Table of Contents
Fetching ...

What should an AI assessor optimise for?

Daniel Romero-Alvarado, Fernando Martínez-Plumed, José Hernández-Orallo

TL;DR

The paper questions whether an AI assessor should be trained to optimise the target metric $L$ or a monotonic proxy and mapping function. It conducts an empirical study across twenty regression and classification problems, training assessors on both direct target losses and proxy losses with transformations, and evaluating using Spearman correlations. Key findings show that carefully chosen proxy losses (notably logistic and logarithmic forms) can outperform direct optimisation for the target metric, suggesting that monotonic transformations enable effective cross-metric assessment. This has practical implications for designing robust, transferable assessors in complex AI systems and motivates further exploration in multiclass and structured-task settings.

Abstract

An AI assessor is an external, ideally indepen-dent system that predicts an indicator, e.g., a loss value, of another AI system. Assessors can lever-age information from the test results of many other AI systems and have the flexibility of be-ing trained on any loss function or scoring rule: from squared error to toxicity metrics. Here we address the question: is it always optimal to train the assessor for the target metric? Or could it be better to train for a different metric and then map predictions back to the target metric? Us-ing twenty regression and classification problems with tabular data, we experimentally explore this question for, respectively, regression losses and classification scores with monotonic and non-monotonic mappings and find that, contrary to intuition, optimising for more informative met-rics is not generally better. Surprisingly, some monotonic transformations are promising. For example, the logistic loss is useful for minimis-ing absolute or quadratic errors in regression, and the logarithmic score helps maximise quadratic or spherical scores in classification.

What should an AI assessor optimise for?

TL;DR

The paper questions whether an AI assessor should be trained to optimise the target metric or a monotonic proxy and mapping function. It conducts an empirical study across twenty regression and classification problems, training assessors on both direct target losses and proxy losses with transformations, and evaluating using Spearman correlations. Key findings show that carefully chosen proxy losses (notably logistic and logarithmic forms) can outperform direct optimisation for the target metric, suggesting that monotonic transformations enable effective cross-metric assessment. This has practical implications for designing robust, transferable assessors in complex AI systems and motivates further exploration in multiclass and structured-task settings.

Abstract

An AI assessor is an external, ideally indepen-dent system that predicts an indicator, e.g., a loss value, of another AI system. Assessors can lever-age information from the test results of many other AI systems and have the flexibility of be-ing trained on any loss function or scoring rule: from squared error to toxicity metrics. Here we address the question: is it always optimal to train the assessor for the target metric? Or could it be better to train for a different metric and then map predictions back to the target metric? Us-ing twenty regression and classification problems with tabular data, we experimentally explore this question for, respectively, regression losses and classification scores with monotonic and non-monotonic mappings and find that, contrary to intuition, optimising for more informative met-rics is not generally better. Surprisingly, some monotonic transformations are promising. For example, the logistic loss is useful for minimis-ing absolute or quadratic errors in regression, and the logarithmic score helps maximise quadratic or spherical scores in classification.

Paper Structure

This paper contains 24 sections, 50 equations, 17 figures, 1 table.

Figures (17)

  • Figure 1: For an energy consumption model $M_1$, we want to anticipate the squared error ($L^{+}_2$) for each new example using an external predictor, called assessor. Recommendations to customers are only made when the assessor predicts low $L^{+}_2$ in the energy consumption estimate. We will explore assessors that optimise for the target loss function (squared loss $L^{+}_2$, top) but also assessors that use a proxy loss function (logistic loss $L^{+}_L$, bottom) followed by a transformation ($f$). Can the proxy assessor be better?
  • Figure 2: Software Effort dataset (Table \ref{['tab:ds']}) 255 regression models. Top: scatter plot of $\hat{y}$ versus $y$. Bottom: histogram of losses: simple ($L^{+}_1$), squared ($L^{+}_2$), and logistic ($L^{+}_L$). The top and bottom rows show the signed and unsigned versions. Assessors predict these losses, noting differences in shape and tails.
  • Figure 3: Histograms of $r_\circledcirc$, the probability estimation for the correct class for some representative datasets, all averaging 255 classification models: CDC Diabetes (most have this shape), JM1 (slightly bimodal) and Higgs (quite bad with values around 0.5). All datasets in Appendix \ref{['app:histograms']}.
  • Figure 4: (Left) Score matrix for XGBoost assessor model. (Right) Aggregated Spearman margin matrix for XGBoost assessor model. In both matrices, rows represent target errors and columns proxy errors. Red values indicate poor performance from trying to predict $L_{\shortrightarrow\!\circ}$ by learning $L_{\circ\!\shortrightarrow}$. Green values show instances where learning from $L_{\circ\!\shortrightarrow}$ is better than from learning directly from $L_{\shortrightarrow\!\circ}$
  • Figure 5: Scatter plots for the assessor of the Parkinson’s Disease Rating Scale for RandomForest Regressor base models and assessor model XGBoost. Because the predictions of the assessor tend to the mean, the case where the proxy is signed takes predictions towards 0, and the predictions usually fall under the diagonal
  • ...and 12 more figures

Theorems & Definitions (6)

  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Definition 3.4
  • Definition 3.5
  • Definition 3.6