Akaike Information Criterion
#Bayesian #Model Selection
Suppose we have a model that describes the data generation process behind a dataset. The distribution by the model is denoted as $\hat f$. The actual data generation process is described by a distribution $f$.
We ask the question:
How good is the approximation using $\hat f$?
To be more precise, how much information is lost if we use our model dist $\hat f$ to substitute the actual data generation distribution $f$?
AIC defines this information loss as
$$ \mathrm{AIC} =  2 \ln p(y\hat\theta) + 2k $$
 $y$: data set
 $\hat\theta$: parameter of the model that is estimated by maximumlikelihood
 $\ln p(y\hat\theta)$: log maximum likelihood (the goodnessoffit)
 $k$: number of adjustable model params; $+2k$ is then a penalty.
The first term represents the goodness of fit and the second term is a penalty for the complexity.
The smaller AIC, the better the model is by the AIC.
Limiting behaviors:
 $k\to0$: $\mathrm{AIC}\to 2 \ln p(y\hat\theta)$, which makes sense since we estimated the parameters using maximum likelihood.
 $k\to\infty$: $\mathrm{AIC}\to\infty$. There is a problem with this. If we have a huge number of adjustable parameters, the data set will not be relevant for choosing a model anymore.
L Ma (2020). 'Akaike Information Criterion', Datumorphism, 11 April. Available at: https://datumorphism.leima.is/cards/statistics/aic/.
Current Ref:

cards/statistics/aic.md