Table of Contents
Fetching ...

Rejection via Learning Density Ratios

Alexander Soen, Hisham Husain, Philip Schulz, Vu Nguyen

TL;DR

This work reframes classification with rejection as learning density ratios between an idealized, regularized distribution $\mathrm{Q}$ and the data distribution $\mathrm{P}$, using $\varphi$-divergence regularization to define the ratio $\rho = d\mathrm{Q}/d\mathrm{P}$. By deriving closed-form density-ratio rejectors under KL and $\alpha$-divergences, and by tying these to Generalized Variational Inference and Distributionally Robust Optimization, the authors recover classical rejection policies (Chow's rule) in the Bayes-optimal limit and enable practical, post-hoc rejection using calibrated posteriors. The approach is validated across six datasets with varying noise, showing competitive or superior accuracy-coverage trade-offs and providing insights into calibration and robustness in selective prediction. Overall, the framework offers a principled, distributional path from risk minimization with rejection to density-ratio based decision rules that can flexibly integrate with pretrained classifiers and robustness considerations in high-stakes settings.

Abstract

Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions. The predominant approach is to alter the supervised learning pipeline by augmenting typical loss functions, letting model rejection incur a lower loss than an incorrect prediction. Instead, we propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance. This can be formalized via the optimization of a loss's risk with a $\varphi$-divergence regularization term. Through this idealized distribution, a rejection decision can be made by utilizing the density ratio between this distribution and the data distribution. We focus on the setting where our $\varphi$-divergences are specified by the family of $α$-divergence. Our framework is tested empirically over clean and noisy datasets.

Rejection via Learning Density Ratios

TL;DR

This work reframes classification with rejection as learning density ratios between an idealized, regularized distribution and the data distribution , using -divergence regularization to define the ratio . By deriving closed-form density-ratio rejectors under KL and -divergences, and by tying these to Generalized Variational Inference and Distributionally Robust Optimization, the authors recover classical rejection policies (Chow's rule) in the Bayes-optimal limit and enable practical, post-hoc rejection using calibrated posteriors. The approach is validated across six datasets with varying noise, showing competitive or superior accuracy-coverage trade-offs and providing insights into calibration and robustness in selective prediction. Overall, the framework offers a principled, distributional path from risk minimization with rejection to density-ratio based decision rules that can flexibly integrate with pretrained classifiers and robustness considerations in high-stakes settings.

Abstract

Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions. The predominant approach is to alter the supervised learning pipeline by augmenting typical loss functions, letting model rejection incur a lower loss than an incorrect prediction. Instead, we propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance. This can be formalized via the optimization of a loss's risk with a -divergence regularization term. Through this idealized distribution, a rejection decision can be made by utilizing the density ratio between this distribution and the data distribution. We focus on the setting where our -divergences are specified by the family of -divergence. Our framework is tested empirically over clean and noisy datasets.
Paper Structure (37 sections, 23 theorems, 73 equations, 15 figures, 3 tables, 1 algorithm)

This paper contains 37 sections, 23 theorems, 73 equations, 15 figures, 3 tables, 1 algorithm.

Key Result

Theorem 2.1

Let us consider the binary CPE setting, where $\mathcal{Y} = \{ 0, 1 \}$, $\mathcal{Y}^{\prime} = [0, 1]$, and $\ell$ be any properProperness ensures that the true class probability is the minimizer of the loss in expectation. loss function reid2010composite (e.g., log loss). Then w.r.t.eq:rej_risk_

Figures (15)

  • Figure 1: An idealized distribution$\color{blue} \mathrm{Q}$ is learned to minimizes the loss of a model. We then compare $\color{blue} \mathrm{Q}$ with the original data distribution $\color{red} \mathrm{P}$ via a density ratio $\rho = {\color{blue} \mathrm{d}\mathrm{Q}} / {\color{red} \mathrm{d}\mathrm{P}}$. A rejection criteria is defined via threshold value $\color{orange} \tau$.
  • Figure 2: Accuracy vs coverage plots across select datasets and all approaches, with 50 equidistant $\tau \in (0, 1]$ and $c \in [0, 0.5)$ values (sorted by coverage). The black horizontal line depicts base models trained without rejection. Missing approaches in the plots indicates that the model rejects more than 60% of test points or has accuracy below the base model. Shaded region indicates $\pm 1$ s.t.d. region.
  • Figure : An idealized distribution$\color{blue} \mathrm{Q}$ is learned to minimizes the loss of a model. We then compare $\color{blue} \mathrm{Q}$ with the original data distribution $\color{red} \mathrm{P}$ via a density ratio $\rho = {\color{blue} \mathrm{d}\mathrm{Q}} / {\color{red} \mathrm{d}\mathrm{P}}$. A rejection criteria is defined via threshold value $\color{orange} \tau$.
  • Figure II: Extended plots for HAR of \ref{['fig:acc_coverage']}.
  • Figure III: Extended plots for Gas Drift of \ref{['fig:acc_coverage']}.
  • ...and 10 more figures

Theorems & Definitions (43)

  • Theorem 2.1: Optimal CPE Rejection / Chow's Rule
  • Definition 2.2
  • Definition 3.1
  • Definition 3.2
  • Theorem 3.3
  • Corollary 3.4
  • Corollary 4.1
  • Theorem 4.2: Informal
  • Theorem 4.3
  • Definition 4.4
  • ...and 33 more