Are you using test log-likelihood correctly?
Sameer K. Deshpande, Soumya Ghosh, Tin D. Nguyen, Tamara Broderick
TL;DR
The paper questions the routine use of test log-likelihood (TLL) as a universal metric for comparing probabilistic models and approximate posteriors. It argues that TLL estimates the expected log predictive density (elpd) and reflects the closeness of the approximate posterior predictive to the data-generating distribution, but does not necessarily align with how well the posterior itself or its predictive distribution approximates the truth, especially under misspecification or finite data. Through a collection of misspecified and well-specified examples, the authors show that higher TLL can coincide with poorer posterior approximations and can even conflict with RMSE-based predictive performance. They advocate using explicit goals to guide evaluation, suggesting complementary metrics and diagnostic tools beyond TLL to assess posterior- or predictive-accuracy, and provide practical guidance for when to rely on TLL and how to interpret its results in practice.
Abstract
Test log-likelihood is commonly used to compare different models of the same data or different approximate inference algorithms for fitting the same probabilistic model. We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations and (ii) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on root mean squared error.
