Why your model parameter confidences might be too optimistic -- unbiased estimation of the inverse covariance matrix
J. Hartlap, P. Simon, P. Schneider
TL;DR
The paper analyzes biases in estimating the inverse covariance matrix and the resulting impact on likelihood-based parameter inference. It proves that the standard ML covariance estimator is singular when the data-vector dimension $p$ exceeds the number of realizations $n$ (or $n-1$ if the mean is estimated from the data) and that its inverse is biased for $p<n-2$, providing a corrected unbiased estimator $igl[ ens{C}^{-1}igr] = rac{n-p-2}{n-1}igl[ ens{C}^{-1}_*igr]$ under Gaussian, independent assumptions. Monte-Carlo experiments illustrate how the bias scales with $p/n$ and show that naive inversion or SVD-based pseudo-inverses can misestimate confidence regions, especially for $p$ close to $n$. Marginalisation and measures of confidence-region size can inherit bias despite the corrected inverse, with up to ~8% bias in marginalised likelihood and up to ~30% in Fisher-based region sizes near $p/n o 1$. Bootstrapping with Gaussian-like noise remains robust, but non-Gaussian statistics (e.g., log-normal noise) break the unbiased estimator’s reliability, underscoring the need for structure-aware covariance estimators in realistic, non-ideal data analyses. Overall, the work provides practical guidance: avoid more bins than realizations, use the unbiased inverse when Gaussian independence holds, and develop improved estimators for non-Gaussian or correlated data.
Abstract
AIMS. The maximum-likelihood method is the standard approach to obtain model fits to observational data and the corresponding confidence regions. We investigate possible sources of bias in the log-likelihood function and its subsequent analysis, focusing on estimators of the inverse covariance matrix. Furthermore, we study under which circumstances the estimated covariance matrix is invertible. METHODS. We perform Monte-Carlo simulations to investigate the behaviour of estimators for the inverse covariance matrix, depending on the number of independent data sets and the number of variables of the data vectors. RESULTS. We find that the inverse of the maximum-likelihood estimator of the covariance is biased, the amount of bias depending on the ratio of the number of bins (data vector variables), P, to the number of data sets, N. This bias inevitably leads to an -- in extreme cases catastrophic -- underestimation of the size of confidence regions. We report on a method to remove this bias for the idealised case of Gaussian noise and statistically independent data vectors. Moreover, we demonstrate that marginalisation over parameters introduces a bias into the marginalised log-likelihood function. Measures of the sizes of confidence regions suffer from the same problem. Furthermore, we give an analytic proof for the fact that the estimated covariance matrix is singular if P>N.
