Assessing the Probabilistic Fit of Neural Regressors via Conditional Congruence

Spencer Young; Riley Sinema; Cole Edgren; Andrew Hall; Nathan Dong; Porter Jenkins

Assessing the Probabilistic Fit of Neural Regressors via Conditional Congruence

Spencer Young, Riley Sinema, Cole Edgren, Andrew Hall, Nathan Dong, Porter Jenkins

TL;DR

The paper addresses the inadequacy of calibration-based metrics for judging probabilistic predictive fit in neural regressors. It introduces conditional congruence and the Conditional Congruence Error (CCE), built on maximum conditional mean discrepancy and conditional kernel mean embeddings, to provide a point-wise, input-specific measure of how closely a model's predictive distribution matches the true conditional distribution. Through theoretical guarantees and extensive experiments on image regression datasets, CCE is shown to be correct, monotonic, reliable, and robust, with the added benefit of diagnosing point-wise failures without labels. The work demonstrates that CCE outperforms traditional metrics like ECE and NLL in characterizing probabilistic alignment, supporting more reliable deployment and enabling applications such as selective rejection based on congruence. It also offers practical guidance on hyperparameters and computation, paving the way for broader adoption in uncertainty quantification for regression tasks.

Abstract

While significant progress has been made in specifying neural networks capable of representing uncertainty, deep networks still often suffer from overconfidence and misaligned predictive distributions. Existing approaches for measuring this misalignment are primarily developed under the framework of calibration, with common metrics such as Expected Calibration Error (ECE). However, calibration can only provide a strictly marginal assessment of probabilistic alignment. Consequently, calibration metrics such as ECE are $\textit{distribution-wise}$ measures and cannot diagnose the $\textit{point-wise}$ reliability of individual inputs, which is important for real-world decision-making. We propose a stronger condition, which we term $\textit{conditional congruence}$, for assessing probabilistic fit. We also introduce a metric, Conditional Congruence Error (CCE), that uses conditional kernel mean embeddings to estimate the distance, at any point, between the learned predictive distribution and the empirical, conditional distribution in a dataset. We perform several high dimensional regression tasks and show that CCE exhibits four critical properties: $\textit{correctness}$, $\textit{monotonicity}$, $\textit{reliability}$, and $\textit{robustness}$.

Assessing the Probabilistic Fit of Neural Regressors via Conditional Congruence

TL;DR

Abstract

measures and cannot diagnose the

reliability of individual inputs, which is important for real-world decision-making. We propose a stronger condition, which we term

, for assessing probabilistic fit. We also introduce a metric, Conditional Congruence Error (CCE), that uses conditional kernel mean embeddings to estimate the distance, at any point, between the learned predictive distribution and the empirical, conditional distribution in a dataset. We perform several high dimensional regression tasks and show that CCE exhibits four critical properties:

, and

Paper Structure (45 sections, 1 theorem, 6 equations, 25 figures, 3 tables, 1 algorithm)

This paper contains 45 sections, 1 theorem, 6 equations, 25 figures, 3 tables, 1 algorithm.

Preliminaries
Notation
Assessing probabilistic fit
Measuring conditional congruence
Maximum Conditional Mean Discrepancy (MCMD)
Computing MCMD
Theoretical guarantees of MCMD
Conditional Congruence Error (CCE)
Experiments
Datasets
Baselines
Estimators of Probablistic Fit
Deep Regression Models
Evaluating the correctness of CCE
Evaluating the monotonicity of CCE
...and 30 more sections

Key Result

Theorem 1

Suppose $k_\mathcal{X}$ and $k_\mathcal{Y}$ are characteristic kernels. Also suppose that the probability measures associated with $X$ and $X'$ in the conditioning space, $P_X$ and $P_{X'}$, are absolutely continuous with respect to each other. Finally, suppose that $P_{Y|X}$ and $P_{Y'|X'}$ admit r

Figures (25)

Figure 1: We test the correctness property of CCE by visualizing the probabilistic fit of four DNNs trained on a synthetic dataset with discrete regression targets. (Row 1) A plot of the predicted mean and 95% credible interval for each model, along with its ECE. (Row 2) NLL incurred on the test points, along with the overall mean value. (Row 3) CCE between the test points and predictive distributions. We plot the mean and one standard deviation range at each point, estimated with a bootstrap approximation of the sampling distribution of CCE. ECE and NLL standard errors are also computed with the bootstrap. CCE is able to accurately characterize the quality of each model's predictive distribution, relative to the test data.
Figure 2: We test the monotonicity property of CCE. We measure its ability to describe the misalignment between the model's predictive distribution and the test data under progressively larger perturbations. For both the COCO-People and AAF regression datasets, we train three DNNs. We analyze the behavior of CCE under Gaussian blur, label noise, and mixup corruptions. In general, CCE increases monotonically as the corruptions become more severe. We run 5 trials for each perturbation and report the mean and standard deviation (error bars indicate $\pm 1$ standard deviation).
Figure 3: We test the robustness property of CCE, compared to NLL. From left to right: A t-SNE projection of the test split of AAF (color indicates age), the CCE values achieved by a trained Gaussian DNN, and the NLL values from this model on each test input. NLL is sensitive to outliers, highlighting a handful of points (see (-40, -18) and (-10, 10)) as being exceptionally poorly fit while assigning roughly equal values to all other points. Meanwhile, CCE illuminates a broader set of points with faulty predictions, shedding light on the local structure of congruence. In this case, many of regions of high incongruence appear to line up with images of older faces, suggesting the model cannot reliably assess the ages of elderly people.
Figure 4: Test images that incurred the lowest (first row) and highest (second row) CCE values for a $\beta$-Gaussian model seitzer2022pitfalls trained on the AAF dataset. CCE values indicated above each image. Higher CCE implies a larger discrepancy between the learned and actual conditional distributions.
Figure 5: Studying the impact of the kernel functions $k_\mathcal{X}$ and $k_\mathcal{Y}$ on the MCMD. We vary both kernel hyperparameters (a) and the class of kernel function used (b).
...and 20 more figures

Theorems & Definitions (2)

Definition 1
Theorem 1

Assessing the Probabilistic Fit of Neural Regressors via Conditional Congruence

TL;DR

Abstract

Assessing the Probabilistic Fit of Neural Regressors via Conditional Congruence

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (25)

Theorems & Definitions (2)