I've been stumped for quite a while trying to decide what the criteria really are for when one should use AIC vs BIC. Burnham and Anderson talk about it quite a bit, but they are such staunch AIC partisans that it took me a while to come around to their point of view. The main reason that I would have preferred BIC is that, if you look at the derivations, BIC approximates the log of the marginal likelihood for a large dataset with an uninformative prior, while AIC approximates the same thing — but with a very strong prior (see p. 212-213 of the book, or Kass and Raftery 1995). From this point of view, the BIC seems more sensible.
On the other hand, B&A make a compelling argument that BIC was developed to identify the "dimension" or "true number of parameters" of a model, and that this is rarely sensible in ecological modeling contexts because of what B&A call tapering effects. That is, if you have a large number of predictor variables some of which have non-zero effect sizes (e.g. regression coefficients) and some of which have zero coefficients, the BIC is trying to tell you how many are non-zero. B&A point out that it is more common that there will really be a few coefficients with large magnitude, more with smaller magnitude, even more with tiny magnitudes … and that all of the predictor variables really have some effect on the response, albeit very small. What we should be trying to do, they say, is identify how many parameters are useful for prediction rather than how many are non-zero. (This also agrees in general with the Bayesian argument against point null hypotheses, i.e. that parameters are never exactly zero — somewhat ironic, since it suggests that Bayesians would prefer AIC over the "Bayesian" Information Criterion.)
The bottom line: I would say the AIC is generally the right choice for ecological questions, over BIC, unless you're really trying to identify a specific number of components. (People do this in time-series analysis, to try to identify the number of time lags or interacting species, although I think they probably shouldn't — the "tapering effects" argument really applies here.)
The argument against the BIC above certainly makes sense but I don't know of any theory that tells us that the AIC is any better in that respect. It tries to find the number of nonzeroes as well, though in a more liberal and upward biased way… I don't know whether that works well as a justification of using the AIC…
If you want a theory that, instead of estimating an unknown and essentially unobservable underlying truth, gives you what you find "interesting" or "useful", you have to formalise your concept of "usefulness" as a new criterion. Particularly, don't expect to get anything objective or standard from it, because usefulness generally is not objective.
Christian Hennig
The BIC is consistent and was designed to identify the "true" dimensionality of an underlying model. The AIC is not consistent but has lower error: "if the number of models of the same dimension does not grow very fast in dimension, the average squared error of the model selected by AIC is asymptotically equivalent to the minimum offered by the candidate models … There has been a debate between AIC and BIC in the literature, centring on the issue of whether the true model is finite-dimensional or infinite-dimensional. There seems to be a consensus that, for the former case, BIC should be preferred, and AIC should be chosen for the latter" (Yang 2005). Furthermore, Yang 2005 shows (apparently: I haven't tried to follow the technical details) that you can't have your cake and eat it too — you have to make a decision between consistency and minimizing prediction errors.
Given that consensus I would say it usually makes more sense to think of infinite-dimensional models (or at least much higher dimension than the most complex of the models we try to fit) as being the default case for ecology, and therefore for AIC being preferred.
Yang, Yuhong. 2005. Can the strengths of AIC and BIC be shared? A conflict between model identification and regression estimation. Biometrika 92, no. 4 (December 1): 937-950. doi:10.1093/biomet/92.4.937.
I've been working with structural equations a lot recently, and have noted large discrepancies quite often between the aic and bic values. Interestingly, the bic often selects for models that actually do not fit the data, but happen to be simpler - particularly when the effect sizes of many coefficients are small. It's made me wary of its usage. The tapering effects argument makes a lot of sense, particularly when applied to ecological data here.
I find it a bit difficult to see how BIC could find a greater number of "tiny" effects and AIC would not also find these "tiny" effects (and more). If the sample size is greater than 7 (which is always - who would use a "non-informative" measure like AIC or BIC when you only have 7 or less data points?) then BIC will always favor a simpler model compared to AIC. BIC can never find a "tiny" effect which AIC does not also find, if the search is done over the same set of models. That is if you calculate AIC and BIC for a fixed set of models, AIC can never prefer a model of lower dimensionality than BIC unless the sample size is below 8, simply because the dimension penalty is the only difference between how they rank models.
However if you use AIC "stepwise" or "forward" or any other path dependent model selection routine, then your final model may be different to a BIC "stepwise" approach. I think this says more about stepwise and other path dependent model selection routines than it does about AIC or BIC. If you do a "stepwise" AIC, then to get BIC for the same models searched simply requires a very simple adjustment of the sequence of AIC values (similar for BIC stepwise).
One other thing to note is that neither BIC nor AIC are particularly good when you have prior information about parameters within a model - such as knowing that variable X has a large effect, variable Y has a small effect, and variable Z has a tiny effect, and so on. Both methods basically assume that you don't know anything about the effect sizes and require them to be estimated from the data (if they did then there would be some place where you could put in that information, and there isn't).
Additionally they are both approximate tools, so will necessarily break down in particular problems. For BIC this will happen when you run into "identifiability" problems, and when MLEs of parameters lie on or close to boundaries of the parameter space, or multi-modal likelihoods (this could quite easily happen in structural equation modeling because of the many latent variables, perhaps explains why BIC). These conditions make the Laplace integral approximation perform poorly, and so BIC will also perform poorly here.
Sorry, didn't see your post for a while.
Post preview:
Close preview