Table of Contents
Fetching ...

PAGER: A Framework for Failure Analysis of Deep Regression Models

Jayaraman J. Thiagarajan, Vivek Narayanaswamy, Puja Trivedi, Rushil Anirudh

TL;DR

PAGER tackles the problem of failure detection in deep regression by challenging the sufficiency of epistemic uncertainty alone for risk characterization. The framework blends forward anchoring (uncertainty) with reverse anchoring (manifold non-conformity) to yield Score_1 and Score_2, organizing test samples into risk regimes (ID, Low Risk, Moderate Risk, High Risk). Across 1D benchmarks, high-dimensional regression, and image regression, PAGER consistently outperforms baselines on false negatives, false positives, and regime-confusion metrics, even under distribution shifts. The approach enhances safety for regression deployments by providing a practical, calibration-free method to detect and categorize failures, with potential impact across healthcare, physical sciences, and robotics.

Abstract

Safe deployment of AI models requires proactive detection of failures to prevent costly errors. To this end, we study the important problem of detecting failures in deep regression models. Existing approaches rely on epistemic uncertainty estimates or inconsistency w.r.t the training data to identify failure. Interestingly, we find that while uncertainties are necessary they are insufficient to accurately characterize failure in practice. Hence, we introduce PAGER (Principled Analysis of Generalization Errors in Regressors), a framework to systematically detect and characterize failures in deep regressors. Built upon the principle of anchored training in deep models, PAGER unifies both epistemic uncertainty and complementary manifold non-conformity scores to accurately organize samples into different risk regimes.

PAGER: A Framework for Failure Analysis of Deep Regression Models

TL;DR

PAGER tackles the problem of failure detection in deep regression by challenging the sufficiency of epistemic uncertainty alone for risk characterization. The framework blends forward anchoring (uncertainty) with reverse anchoring (manifold non-conformity) to yield Score_1 and Score_2, organizing test samples into risk regimes (ID, Low Risk, Moderate Risk, High Risk). Across 1D benchmarks, high-dimensional regression, and image regression, PAGER consistently outperforms baselines on false negatives, false positives, and regime-confusion metrics, even under distribution shifts. The approach enhances safety for regression deployments by providing a practical, calibration-free method to detect and categorize failures, with potential impact across healthcare, physical sciences, and robotics.

Abstract

Safe deployment of AI models requires proactive detection of failures to prevent costly errors. To this end, we study the important problem of detecting failures in deep regression models. Existing approaches rely on epistemic uncertainty estimates or inconsistency w.r.t the training data to identify failure. Interestingly, we find that while uncertainties are necessary they are insufficient to accurately characterize failure in practice. Hence, we introduce PAGER (Principled Analysis of Generalization Errors in Regressors), a framework to systematically detect and characterize failures in deep regressors. Built upon the principle of anchored training in deep models, PAGER unifies both epistemic uncertainty and complementary manifold non-conformity scores to accurately organize samples into different risk regimes.
Paper Structure (17 sections, 4 equations, 8 figures, 6 tables, 3 algorithms)

This paper contains 17 sections, 4 equations, 8 figures, 6 tables, 3 algorithms.

Figures (8)

  • Figure 1: Epistemic uncertainty, while necessary, is not sufficient to completely characterize all risk regimes.Top: Out-of-support (OOS) samples in the range of $[2.2-2.7]$ exhibit low uncertainty but moderate risk due to significant deviation from true function. Bottom: Even with better experiment designs, uncertainty alone in the extrapolating regime $[4.5-5]$ is unreliable due to potential drift from the truth. We propose PAGER , a framework that leverages anchoring thiagarajan2022single to unify prediction uncertainty and non-conformity to the training data manifold. PAGER accurately flags those erroneous regimes as Moderate Risk (shown in blue) and outperforms existing baselines in accurately categorizing samples consistent with the true risk (lower MAE).
  • Figure 2: An illustration of different risk regimes. Using examples in 1D and 2D, we show ID, OOS and OOD regimes.
  • Figure 3: Overview of our proposed framework. PAGER organizes test examples into bins (low, moderate and high) using both predictive uncertainty and MNC scores. With such a categorization, PAGER associates samples into $4$ levels of expected risk (ID, Low Risk, Moderate Risk and High Risk). We also advocate a suite of metrics that enables a holistic assessment of failure detectors.
  • Figure 4: PAGER can detect failures under complex distribution shifts effectively. We assess PAGER on the Skillcraft dataset characterized by real-world shifts (change in league index), and find that it achieves reductions in all metrics over the baselines.
  • Figure 5: Efficacy of PAGER on Image Regression Benchmarks. We can observe that in comparison to the state-of-the art baseline DEUP, PAGER effectively minimizes the FN, FP and confusion metrics even under challenging extrapolation scenarios. We find that PAGER can consistently flag samples from the unobserved regimes which corresponds to highly erroneous predictions.
  • ...and 3 more figures