Table of Contents
Fetching ...

Criterion Collapse and Loss Distribution Control

Matthew J. Holland

TL;DR

This work investigates criterion collapse, the phenomenon where optimizing one learning criterion implies optimality in another, extending beyond standard mean-based losses to a wide range of risk criteria. It develops a unified theoretical framework for Bernoulli (zero-one) losses and surrogates, showing that many monotone criteria (e.g., DRO, CVaR, tilted ERM) collapse to error-probability minimizers, while non-monotone criteria can avoid this. The authors introduce loss-restraining criteria and non-monotonic surrogates (e.g., Flooding, SoftAD) and demonstrate that such approaches can balance surrogate loss, accuracy, and model norm in empirical image-classification experiments. The results offer methodological guidance for designing learning objectives that align with diverse evaluation metrics and caution against over-optimizing monotone risk criteria in highly expressive models.

Abstract

In this work, we consider the notion of "criterion collapse," in which optimization of one metric implies optimality in another, with a particular focus on conditions for collapse into error probability minimizers under a wide variety of learning criteria, ranging from DRO and OCE risks (CVaR, tilted ERM) to non-monotonic criteria underlying recent ascent-descent algorithms explored in the literature (Flooding, SoftAD). We show how collapse in the context of losses with a Bernoulli distribution goes far beyond existing results for CVaR and DRO, then expand our scope to include surrogate losses, showing conditions where monotonic criteria such as tilted ERM cannot avoid collapse, whereas non-monotonic alternatives can.

Criterion Collapse and Loss Distribution Control

TL;DR

This work investigates criterion collapse, the phenomenon where optimizing one learning criterion implies optimality in another, extending beyond standard mean-based losses to a wide range of risk criteria. It develops a unified theoretical framework for Bernoulli (zero-one) losses and surrogates, showing that many monotone criteria (e.g., DRO, CVaR, tilted ERM) collapse to error-probability minimizers, while non-monotone criteria can avoid this. The authors introduce loss-restraining criteria and non-monotonic surrogates (e.g., Flooding, SoftAD) and demonstrate that such approaches can balance surrogate loss, accuracy, and model norm in empirical image-classification experiments. The results offer methodological guidance for designing learning objectives that align with diverse evaluation metrics and caution against over-optimizing monotone risk criteria in highly expressive models.

Abstract

In this work, we consider the notion of "criterion collapse," in which optimization of one metric implies optimality in another, with a particular focus on conditions for collapse into error probability minimizers under a wide variety of learning criteria, ranging from DRO and OCE risks (CVaR, tilted ERM) to non-monotonic criteria underlying recent ascent-descent algorithms explored in the literature (Flooding, SoftAD). We show how collapse in the context of losses with a Bernoulli distribution goes far beyond existing results for CVaR and DRO, then expand our scope to include surrogate losses, showing conditions where monotonic criteria such as tilted ERM cannot avoid collapse, whereas non-monotonic alternatives can.
Paper Structure (34 sections, 6 theorems, 83 equations, 5 figures, 1 table)

This paper contains 34 sections, 6 theorems, 83 equations, 5 figures, 1 table.

Key Result

Theorem 1

For arbitrary random loss $\mathop{\mathrm{\mathsf{L}}}\nolimits \in \mathcal{L}$, denote the distributionally robust optimization (DRO) criterion by where the "uncertainty set" $\mathcal{P}$ is taken to be a ball centered at some pre-defined data distribution on $\mathcal{X} \times \mathcal{Y}$, with finite radius measured by a valid $f$-divergence. Under zero-one loss $\mathop{\mathrm{\mathsf{L

Figures (5)

  • Figure 1: In the left plot, we show the three possible data points that can arise in the example described in §\ref{['sec:surrogates_nolink']}. Points above the dashed silver line are assigned a label of $1$ by $h_{1}$ and $-1$ by $h_{2}$; signs are reversed for all points below this line. For the outlying point in the bottom right, we have set $a=2$ in this example. In the right plot, we illustrate setting $p > 1/2$ to ensure the optimality of $h_{1}$ and $h_{2}$ in distinct criteria diverges.
  • Figure 2: Key metrics of interest (vertical axis) over epochs (horizontal axis). Here "loss" refers to average surrogate loss, "acc" refers to accuracy, and "norm" refers to the model L2 norm. Loss and accuracy are given for both training (dotted lines) and test data (solid lines). Plots on the left are for CIFAR-100, and plots on the right are for SVHN.
  • Figure 3: Graphs of $g_{\tau}(\cdot)$ in (\ref{['eqn:nonmonotonic_variantile_helper']}) over the unit interval for varying choices of $\tau$.
  • Figure 4: Examples of valid choices of $\rho$ (left) and $\widetilde{\rho}$ (right) for use in defining OCE criteria (\ref{['eqn:defn_OCE']}) and loss-restraining criteria (\ref{['eqn:defn_Cinner']})--(\ref{['eqn:defn_Couter']}) respectively.
  • Figure 5: Results for CIFAR-10 and FashionMNIST; see the caption of Figure \ref{['fig:benchmarks_1']} for details.

Theorems & Definitions (20)

  • Theorem 1: DRO criterion; hu2018a (hu2018a, Thm. 1)
  • Theorem 2: CVaR criterion; zhai2021b (zhai2021b, Prop. 1)
  • Proposition 3: Collapse of left quantiles
  • Remark 4: Related case: right quantiles
  • Proposition 5: Collapse under monotonic dispersion
  • Remark 6: Special case: OCE criteria
  • Remark 7: Related case: Cressie-Read DRO
  • Remark 8: Related case: criteria based on Orlicz regret
  • Remark 9: Non-monotonic alternative: variantile
  • Proposition 10
  • ...and 10 more