Table of Contents
Fetching ...

Are you using test log-likelihood correctly?

Sameer K. Deshpande, Soumya Ghosh, Tin D. Nguyen, Tamara Broderick

TL;DR

The paper questions the routine use of test log-likelihood (TLL) as a universal metric for comparing probabilistic models and approximate posteriors. It argues that TLL estimates the expected log predictive density (elpd) and reflects the closeness of the approximate posterior predictive to the data-generating distribution, but does not necessarily align with how well the posterior itself or its predictive distribution approximates the truth, especially under misspecification or finite data. Through a collection of misspecified and well-specified examples, the authors show that higher TLL can coincide with poorer posterior approximations and can even conflict with RMSE-based predictive performance. They advocate using explicit goals to guide evaluation, suggesting complementary metrics and diagnostic tools beyond TLL to assess posterior- or predictive-accuracy, and provide practical guidance for when to rely on TLL and how to interpret its results in practice.

Abstract

Test log-likelihood is commonly used to compare different models of the same data or different approximate inference algorithms for fitting the same probabilistic model. We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations and (ii) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on root mean squared error.

Are you using test log-likelihood correctly?

TL;DR

The paper questions the routine use of test log-likelihood (TLL) as a universal metric for comparing probabilistic models and approximate posteriors. It argues that TLL estimates the expected log predictive density (elpd) and reflects the closeness of the approximate posterior predictive to the data-generating distribution, but does not necessarily align with how well the posterior itself or its predictive distribution approximates the truth, especially under misspecification or finite data. Through a collection of misspecified and well-specified examples, the authors show that higher TLL can coincide with poorer posterior approximations and can even conflict with RMSE-based predictive performance. They advocate using explicit goals to guide evaluation, suggesting complementary metrics and diagnostic tools beyond TLL to assess posterior- or predictive-accuracy, and provide practical guidance for when to rely on TLL and how to interpret its results in practice.

Abstract

Test log-likelihood is commonly used to compare different models of the same data or different approximate inference algorithms for fitting the same probabilistic model. We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations and (ii) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on root mean squared error.
Paper Structure (22 sections, 22 equations, 11 figures)

This paper contains 22 sections, 22 equations, 11 figures.

Figures (11)

  • Figure 1: (Left). Predictive distributions under the Bayesian posterior and mean field variational approximations. The two numbers in the title of each plot are the 2-Wasserstein distance to the exact posterior and test log-likelihood computed on $10^4$ test set observations. Two standard errors in the test log-likelihood estimate are (A) 0.03, (B) 0.03, (C) 0.02, (D) 0.02, (E) 0.02, (F) 0.02. (Right). The relationship between 2-Wasserstein distance to the posterior and test log-likelihood.
  • Figure 2: Contours of (A) the exact posterior, (B) the mean field variational approximation restricted to isotropic Gaussians, and (C)--(F) re-scaled mean field approximations. The line $\theta_1 = 0$ is highlighted in red.
  • Figure 3: (Left). Predictive distributions under the Bayesian posterior (A) and the SWAG posterior with SWAG learning rate of (B) $10^{-3}$, (C) $10^{-2}$, (D) $10^{-1}$, (E) $1$, and (F) $10$. The two numbers in the title of each plot are the 2-Wasserstein distance to the exact posterior and test log-likelihood computed on $10^4$ test set observations. Two standard errors in the test log-likelihood estimates are (A) 0.16, (B) 0.15, (C) 0.14, (D) 0.13, (E) 0.05, (F) 0.01. (Right). Contours of the (A) exact posterior, and (B)--(F) SWAG approximations with different learning rates. The line $\theta_1 = 0$ is highlighted in red.
  • Figure 4: (Left). Contours of (A) the exact posterior, (B) the mean field variational approximation restricted to isotropic Gaussians, and (C)--(F) re-scaled mean field approximations. The two numbers in the title of each plot are the 2-Wasserstein distance to the exact posterior and test log-likelihoods computed on $10^4$ test set observations. Two standard errors in the test log-likelihood estimates are (A) 0.019, (B) 0.020, (C) 0.014, (D) 0.013, (E) 0.011, (F) 0.009. (Right). The non-monotonic relationship between distance to posterior and test log-likelihood. Observe that the exact posterior does not achieve highest test log-likelihood.
  • Figure 5: Cartoon illustration highlighting the difference between three different discrepancies explored in \ref{['sec:intuition']}. The surfaces are spaces of distributions over a latent parameter (lower surface) or an observable data point $y^{\star}$ (upper surface). The pink line indicates that $\textrm{TLL}(\mathcal{D}^{\star};\hat{\Pi})$ estimates a discrepancy between the approximate posterior predictive $\hat{\Pi}(y^{\star} \vert \mathcal{D})$ (upper surface, lower right, red dot) and the true data-generating distribution $\mathcal{P}(y^{\star})$ (upper surface, upper right, black dot). The blue line represents a different discrepancy between the exact posterior predictive (upper surface, left, green dot) and the approximate posterior predictive (upper surface, lower right, red dot). The yellow line represents another different discrepancy between the exact posterior (lower surface, left, green dot) and the approximate posterior (lower surface, right, red dot). Gray lines connect distributions over parameters with their corresponding predictive distributions.
  • ...and 6 more figures