Assessment of the quality of a prediction
Roger Sewell
TL;DR
The paper argues that the true mutual information $I(x;y)$ is ill-suited for evaluating a prediction algorithm’s output, and promotes the Apparent Shannon Information $J(x;Q_y)$ as the appropriate, uniquely characterized metric. It develops a Bayesian framework using Dirichlet-based mixtures of skew-Student distributions to model the distribution of $j(x,y)=\log\left(\frac{Q_y(x)}{P(x)}\right)$ and to infer the posterior uncertainty in $J(x;Q_y)$, addressing heavy-tailed, asymmetric behavior. The method is illustrated on a Bayesian model predicting the recurrence time of prostate cancer, and is presented as generally applicable to problems where the explicit distribution of $j(x,y)$ is intractable. The work provides a principled approach to uncertainty quantification for prediction quality and offers guidance for comparing Bayesian predictive algorithms under unseen data, with practical implications for model design and evaluation. Overall, it contributes a rigorous, adaptable framework for assessing high-quality predictions beyond simple point metrics.
Abstract
Shannon defined the mutual information between two variables. We illustrate why the true mutual information between a variable and the predictions made by a prediction algorithm is not a suitable measure of prediction quality, but the apparent Shannon mutual information (ASI) is; indeed it is the unique prediction quality measure with either of two very different lists of desirable properties, as previously shown by de Finetti and other authors. However, estimating the uncertainty of the ASI is a difficult problem, because of long and non-symmetric heavy tails to the distribution of the individual values of $j(x,y)=\log\frac{Q_y(x)}{P(x)}$ We propose a Bayesian modelling method for the distribution of $j(x,y)$, from the posterior distribution of which the uncertainty in the ASI can be inferred. This method is based on Dirichlet-based mixtures of skew-Student distributions. We illustrate its use on data from a Bayesian model for prediction of the recurrence time of prostate cancer. We believe that this approach is generally appropriate for most problems, where it is infeasible to derive the explicit distribution of the samples of $j(x,y)$, though the precise modelling parameters may need adjustment to suit particular cases.
