Table of Contents
Fetching ...

Why your model parameter confidences might be too optimistic -- unbiased estimation of the inverse covariance matrix

J. Hartlap, P. Simon, P. Schneider

TL;DR

The paper analyzes biases in estimating the inverse covariance matrix and the resulting impact on likelihood-based parameter inference. It proves that the standard ML covariance estimator is singular when the data-vector dimension $p$ exceeds the number of realizations $n$ (or $n-1$ if the mean is estimated from the data) and that its inverse is biased for $p<n-2$, providing a corrected unbiased estimator $igl[ ens{C}^{-1}igr] = rac{n-p-2}{n-1}igl[ ens{C}^{-1}_*igr]$ under Gaussian, independent assumptions. Monte-Carlo experiments illustrate how the bias scales with $p/n$ and show that naive inversion or SVD-based pseudo-inverses can misestimate confidence regions, especially for $p$ close to $n$. Marginalisation and measures of confidence-region size can inherit bias despite the corrected inverse, with up to ~8% bias in marginalised likelihood and up to ~30% in Fisher-based region sizes near $p/n o 1$. Bootstrapping with Gaussian-like noise remains robust, but non-Gaussian statistics (e.g., log-normal noise) break the unbiased estimator’s reliability, underscoring the need for structure-aware covariance estimators in realistic, non-ideal data analyses. Overall, the work provides practical guidance: avoid more bins than realizations, use the unbiased inverse when Gaussian independence holds, and develop improved estimators for non-Gaussian or correlated data.

Abstract

AIMS. The maximum-likelihood method is the standard approach to obtain model fits to observational data and the corresponding confidence regions. We investigate possible sources of bias in the log-likelihood function and its subsequent analysis, focusing on estimators of the inverse covariance matrix. Furthermore, we study under which circumstances the estimated covariance matrix is invertible. METHODS. We perform Monte-Carlo simulations to investigate the behaviour of estimators for the inverse covariance matrix, depending on the number of independent data sets and the number of variables of the data vectors. RESULTS. We find that the inverse of the maximum-likelihood estimator of the covariance is biased, the amount of bias depending on the ratio of the number of bins (data vector variables), P, to the number of data sets, N. This bias inevitably leads to an -- in extreme cases catastrophic -- underestimation of the size of confidence regions. We report on a method to remove this bias for the idealised case of Gaussian noise and statistically independent data vectors. Moreover, we demonstrate that marginalisation over parameters introduces a bias into the marginalised log-likelihood function. Measures of the sizes of confidence regions suffer from the same problem. Furthermore, we give an analytic proof for the fact that the estimated covariance matrix is singular if P>N.

Why your model parameter confidences might be too optimistic -- unbiased estimation of the inverse covariance matrix

TL;DR

The paper analyzes biases in estimating the inverse covariance matrix and the resulting impact on likelihood-based parameter inference. It proves that the standard ML covariance estimator is singular when the data-vector dimension exceeds the number of realizations (or if the mean is estimated from the data) and that its inverse is biased for , providing a corrected unbiased estimator under Gaussian, independent assumptions. Monte-Carlo experiments illustrate how the bias scales with and show that naive inversion or SVD-based pseudo-inverses can misestimate confidence regions, especially for close to . Marginalisation and measures of confidence-region size can inherit bias despite the corrected inverse, with up to ~8% bias in marginalised likelihood and up to ~30% in Fisher-based region sizes near . Bootstrapping with Gaussian-like noise remains robust, but non-Gaussian statistics (e.g., log-normal noise) break the unbiased estimator’s reliability, underscoring the need for structure-aware covariance estimators in realistic, non-ideal data analyses. Overall, the work provides practical guidance: avoid more bins than realizations, use the unbiased inverse when Gaussian independence holds, and develop improved estimators for non-Gaussian or correlated data.

Abstract

AIMS. The maximum-likelihood method is the standard approach to obtain model fits to observational data and the corresponding confidence regions. We investigate possible sources of bias in the log-likelihood function and its subsequent analysis, focusing on estimators of the inverse covariance matrix. Furthermore, we study under which circumstances the estimated covariance matrix is invertible. METHODS. We perform Monte-Carlo simulations to investigate the behaviour of estimators for the inverse covariance matrix, depending on the number of independent data sets and the number of variables of the data vectors. RESULTS. We find that the inverse of the maximum-likelihood estimator of the covariance is biased, the amount of bias depending on the ratio of the number of bins (data vector variables), P, to the number of data sets, N. This bias inevitably leads to an -- in extreme cases catastrophic -- underestimation of the size of confidence regions. We report on a method to remove this bias for the idealised case of Gaussian noise and statistically independent data vectors. Moreover, we demonstrate that marginalisation over parameters introduces a bias into the marginalised log-likelihood function. Measures of the sizes of confidence regions suffer from the same problem. Furthermore, we give an analytic proof for the fact that the estimated covariance matrix is singular if P>N.

Paper Structure

This paper contains 12 sections, 21 equations, 3 figures.

Figures (3)

  • Figure 1: Ratios of the trace of $\tens{\Sigma}^{-1}$ to the traces of $\tens{C}^{-1}_*$ (triangles) and $\hat{\tens{C}}^{-1}$ (squares), respectively. The dashed line is for the covariance model (\ref{['Sigdc']}), the solid line for (\ref{['Sigdl']}) and the dot-dashed-line for (\ref{['Signd']}). The original data vectors had $p_1=240$ bins, and were rebinned by subsequently joining $2,\, 3,\, \ldots$ of the original bins. The number of independent observations is $n=60$. Error bars are comparable to the symbol size and therefore omitted.
  • Figure 2: Triangles, solid lines: Ratio of the sum over all pixels of the marginalised likelihood computed using $\hat{\tens{C}}^{-1}$ and the true marginalised likelihood. Filled triangles are for the power-law fit (marginalised over the power-law index), open triangles are for the straight line fit (marginalised over the intercept). Squares, dashed lines: Ratio of $\sqrt{\det\tens{F}^{-1}}$ using $\hat{\tens{C}}^{-1}$ to the true one, computed with $\tens{\Sigma}$. For both cases $\tens{\Sigma}=\tens{\Sigma}^{\rm d,c}$.
  • Figure 3: Ratio of the traces of the unbiased estimator $\hat{\tens{C}}^{-1}_*$ and $\hat{\tens{C}}^{-1}$ to the trace of $\tens{\Sigma}^{-1}$; the covariances for the solid curve have been estimated using bootstrapping (see text), the dashed line shows the ratio of the traces for log-normal errors.