Table of Contents
Fetching ...

Robust model selection using likelihood as data

Jongwoo Choi, Neil A. Spencer, Jeffrey W. Miller

TL;DR

A novel approach to model selection based on modeling the likelihood values themselves is introduced, using a multivariate normal model to estimate and quantify uncertainty in this expectation, providing calibrated inferences for robust model selection under misspecification.

Abstract

Model selection is a central task in statistics, but standard methods are not robust in misspecified settings where the true data-generating process (DGP) is not in the set of candidate models. The key limitation is that existing methods -- including information criteria and Bayesian posteriors -- do not quantify uncertainty about how well each candidate model approximates the true DGP. In this paper, we introduce a novel approach to model selection based on modeling the likelihood values themselves. Specifically, given $K$ candidate models and $n$ observations, we view the $n\times K$ matrix of negative log-likelihood values as a random data matrix and observe that the expectation of each row is equal to the vector of Kullback--Leibler divergences between the $K$ models and the true DGP, up to an additive constant. We use a multivariate normal model to estimate and quantify uncertainty in this expectation, providing calibrated inferences for robust model selection under misspecification. The procedure is easy to compute, interpretable, and comes with theoretical guarantees, including consistency.

Robust model selection using likelihood as data

TL;DR

A novel approach to model selection based on modeling the likelihood values themselves is introduced, using a multivariate normal model to estimate and quantify uncertainty in this expectation, providing calibrated inferences for robust model selection under misspecification.

Abstract

Model selection is a central task in statistics, but standard methods are not robust in misspecified settings where the true data-generating process (DGP) is not in the set of candidate models. The key limitation is that existing methods -- including information criteria and Bayesian posteriors -- do not quantify uncertainty about how well each candidate model approximates the true DGP. In this paper, we introduce a novel approach to model selection based on modeling the likelihood values themselves. Specifically, given candidate models and observations, we view the matrix of negative log-likelihood values as a random data matrix and observe that the expectation of each row is equal to the vector of Kullback--Leibler divergences between the models and the true DGP, up to an additive constant. We use a multivariate normal model to estimate and quantify uncertainty in this expectation, providing calibrated inferences for robust model selection under misspecification. The procedure is easy to compute, interpretable, and comes with theoretical guarantees, including consistency.
Paper Structure (50 sections, 21 theorems, 198 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 50 sections, 21 theorems, 198 equations, 8 figures, 1 table, 2 algorithms.

Key Result

Theorem 4.1

Assume Conditions cond1 and cond2, and suppose $\delta \neq \mu^0_k - \mu^0_{\mathrm{min}}$ for all $k = 1,\ldots,K$. Let $\alpha_n \geq 0$ such that $\alpha_n \to \infty$ with $\alpha_n = o(\sqrt{n})$. Define $w_{\delta}(k \mid Z_{1:n})$ as in eq:pkn2 based on the NIW posterior in sec:method:bayesi

Figures (8)

  • Figure 1: Limitations of existing model selection methods. (Left) Histogram of the Shapley galaxy velocities ($\times 10^3$ km/s). We trim the top $0.5\%$ of data points to remove the extremely long right tail extending beyond $45{,}000$ km/s. (Middle) Boxplots summarizing draws from the Bayesian posterior on $k$ for a Gaussian mixture model with $k$ components, for sample sizes $n \in \{40, 120, 400, 1200, 4000\}$. As $n$ grows, the posterior shifts toward larger $k$, a typical phenomenon under misspecification. (Right) Values of $k$ selected by AIC (red) and BIC (blue) for each $n$. These results illustrate lack of robustness and inability to quantify uncertainty in how well a $k$-component Gaussian mixture can approximate the true DGP.
  • Figure 2: Instability under ties with $K=3$. The figure shows 1,000 posterior draws from $\mu \sim \mathcal{N}(\mu^0, \Sigma^0/n)$ with $n=500$, $\mu^0 = (0,0,0)^\mathtt{T}$, and $\Sigma^0$ as specified in the text. The left panel shows the strong negative correlation between $\mu_1$ and $\mu_2$, the middle panel shows that $\mu_3$ is much more concentrated than $\mu_1$ and $\mu_2$, and the right panel makes clear that $\mu_3$ is rarely the minimum.
  • Figure 3: Gaussian mixture model results on the Shapley galaxy dataset. (Top row) Usual Bayesian posterior on $k$ (bars) with AIC (red dashed) and BIC (blue dot-dash) overlaid; as $n$ increases, all three trend toward larger $k$. (Second row) LaD posteriors on $\mu_k$ for each $k$ show the fit to the true DGP as a function $k$, and concentrate as $n$ grows. (Third row) SLC score $\hat{w}_{\delta}(k)$ for tolerances $\delta \in \{0.3, 0.12, 0.06\}$. Larger $\delta$ selects simpler models; increasing $n$ concentrates the scores onto a single $k$ for each $\delta$. (Bottom row) Heatmap of "posterior path" of SLC score $\hat{w}_{\delta}(k)$ shows the transition from selecting more complex to simpler models as a function of the rescaled tolerance $\hat{\tau} = \delta/(\hat{\mu}_{\mathrm{noise}} - \hat{\mu}_{\mathrm{min}}) \in [0,1]$.
  • Figure 4: Sparse multivariate normal model results. (Top) SLC scores $\hat{w}_{\delta}(k)$ for sample sizes $n \in \{50, 500, 5000\}$ (columns) and tolerances $\delta \in \{0.75, 0.26, 0.05\}$ (rows). (Bottom) Boxplots of the LaD posteriors on $\mu_k$ for $k \in \{1,\ldots,7\}$.
  • Figure 5: Performance comparisons using Brier score. Points show the mean Brier loss, and lines show $\pm$ standard error across $50$ datasets for sample sizes $n \in \{50, 500, 5000\}$ (x-axis; vertical dashed lines separate values of $n$). Each panel corresponds to tolerance settings: left $\delta=0.75$, middle $\delta = 0.26$, and right $\delta = 0.05$. Colors indicate the competing methods.
  • ...and 3 more figures

Theorems & Definitions (44)

  • Definition 1
  • Definition 2
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Theorem 4.4
  • Corollary 4.5
  • Lemma S2.2
  • proof : Proof of \ref{['supp:lemma:high-prob-bound']}
  • Theorem S2.3
  • ...and 34 more