Stated more precisely, the maximum likelihood estimate of a model parameters is
where X is the observed data and θ represents the model parameters. The MAP estimate of the model parameters is
The relationship between compression, truth and maximum likelihood can be seen by taking a bound on the maximum likelihood.
If p(X) is the distribution of a random variable X that takes on values x from the real numbers and q(x) is some other distribution, then by the definition of a distribution, we know that
If we take N repeated independent samples xi of X, then the expected value of the mean of log q(xi) is given by
But since log y ≥ y-1 for positive y, we have
Thus,
and equality can only be achieved where q(x) = p(x).
This means that maximizing the expected value of the mean value of log q is the same as finding p. To the extent that the law of large numbers lets us approximate this expected value by the observed mean, maximizing this observed mean lets us approximate p.
Thus it can be said that statistical inference can let us approximate the truth.
Interestingly, the negative of this mean value of log q is the expected length of a compressed representation of the xi where q is the model used to do the compression. Thus we can also claim that ultimate compression = truth. This leads us off to [Occam's Razor]?.