Trading off Consistency and Dimensionality of Convex Surrogates for the Mode

Enrique Nueve; Bo Waggoner; Dhamma Kimpara; Jessie Finocchiaro

Trading off Consistency and Dimensionality of Convex Surrogates for the Mode

Enrique Nueve, Bo Waggoner, Dhamma Kimpara, Jessie Finocchiaro

TL;DR

This work reevaluates the classic multiclass surrogate-design problem by relaxing the requirement of full $d\ge n-1$ consistency and exploring partial consistency through polytope embeddings. By embedding outcomes into low-dimensional polytopes and using a square-loss surrogate with a MAP link, the authors demonstrate that hallucination regions inevitably arise when $d<n-1$, but calibration regions exist under structured low-noise assumptions, with concrete instantiations such as $n=2^{d}$ in a unit cube and $n=d!$ in a permutahedron. They further show that, under low-noise, calibration can be preserved for these embeddings, and that using multiple problem instances enables reliable mode elicitation in dimension $d$ roughly half the number of outcomes, offering a scalable alternative to full $n-1$-dimensional surrogates. These insights provide practical guidance for selecting embeddings and problem-instance strategies in large-scale multiclass or structured prediction tasks, with implications for calibration guarantees and parallelizable optimization.

Abstract

In multiclass classification over $n$ outcomes, the outcomes must be embedded into the reals with dimension at least $n-1$ in order to design a consistent surrogate loss that leads to the "correct" classification, regardless of the data distribution. For large $n$, such as in information retrieval and structured prediction tasks, optimizing a surrogate in $n-1$ dimensions is often intractable. We investigate ways to trade off surrogate loss dimension, the number of problem instances, and restricting the region of consistency in the simplex for multiclass classification. Following past work, we examine an intuitive embedding procedure that maps outcomes into the vertices of convex polytopes in a low-dimensional surrogate space. We show that full-dimensional subsets of the simplex exist around each point mass distribution for which consistency holds, but also, with less than $n-1$ dimensions, there exist distributions for which a phenomenon called hallucination occurs, which is when the optimal report under the surrogate loss is an outcome with zero probability. Looking towards application, we derive a result to check if consistency holds under a given polytope embedding and low-noise assumption, providing insight into when to use a particular embedding. We provide examples of embedding $n = 2^{d}$ outcomes into the $d$-dimensional unit cube and $n = d!$ outcomes into the $d$-dimensional permutahedron under low-noise assumptions. Finally, we demonstrate that with multiple problem instances, we can learn the mode with $\frac{n}{2}$ dimensions over the whole simplex.

Trading off Consistency and Dimensionality of Convex Surrogates for the Mode

TL;DR

This work reevaluates the classic multiclass surrogate-design problem by relaxing the requirement of full

consistency and exploring partial consistency through polytope embeddings. By embedding outcomes into low-dimensional polytopes and using a square-loss surrogate with a MAP link, the authors demonstrate that hallucination regions inevitably arise when

, but calibration regions exist under structured low-noise assumptions, with concrete instantiations such as

in a unit cube and

in a permutahedron. They further show that, under low-noise, calibration can be preserved for these embeddings, and that using multiple problem instances enables reliable mode elicitation in dimension

roughly half the number of outcomes, offering a scalable alternative to full

-dimensional surrogates. These insights provide practical guidance for selecting embeddings and problem-instance strategies in large-scale multiclass or structured prediction tasks, with implications for calibration guarantees and parallelizable optimization.

Abstract

In multiclass classification over

outcomes, the outcomes must be embedded into the reals with dimension at least

in order to design a consistent surrogate loss that leads to the "correct" classification, regardless of the data distribution. For large

, such as in information retrieval and structured prediction tasks, optimizing a surrogate in

dimensions is often intractable. We investigate ways to trade off surrogate loss dimension, the number of problem instances, and restricting the region of consistency in the simplex for multiclass classification. Following past work, we examine an intuitive embedding procedure that maps outcomes into the vertices of convex polytopes in a low-dimensional surrogate space. We show that full-dimensional subsets of the simplex exist around each point mass distribution for which consistency holds, but also, with less than

dimensions, there exist distributions for which a phenomenon called hallucination occurs, which is when the optimal report under the surrogate loss is an outcome with zero probability. Looking towards application, we derive a result to check if consistency holds under a given polytope embedding and low-noise assumption, providing insight into when to use a particular embedding. We provide examples of embedding

outcomes into the

-dimensional unit cube and

outcomes into the

-dimensional permutahedron under low-noise assumptions. Finally, we demonstrate that with multiple problem instances, we can learn the mode with

dimensions over the whole simplex.

Paper Structure (22 sections, 20 theorems, 20 equations, 3 figures, 3 tables, 2 algorithms)

This paper contains 22 sections, 20 theorems, 20 equations, 3 figures, 3 tables, 2 algorithms.

Introduction
Background and Notation
Property Elicitation, Consistency, and Prediction Dimension
Polytope Embedding and Existence of Calibrated Regions
Polytope Embedding Construction
Hallucination Regions
Calibration Regions
Restoring Inconsistent Surrogates via Low-Noise Assumptions
Calibration via Low Noise Assumptions
Embedding into the Unit Cube and Permutahedron under Low-Noise
Unit Cube
Permutahedron
Elicitation in Low Dimensions with Multiple Problem Instances
Discussion and Conclusion
Notation tables
...and 7 more sections

Key Result

Theorem 1

Let $\ell :\mathcal{Y}\times \mathcal{Y} \to \mathbb{R}_{+}$ and $\mathcal{P}\subseteq \Delta_{\mathcal{Y}}$. Let $\Gamma :\mathcal{P}\rightrightarrows \mathbb{R}^{d}$ and $\psi :\mathbb{R}^{d}\to \mathcal{Y}$ be such that $\Gamma$ is elicitable and $(\Gamma ,\psi )$ is an $\ell$-calibrated property

Figures (3)

Figure 1: (Left) Mode level sets of $\Delta_{\mathcal{Y}}$ where $\mathcal{Y} =\{a,b,c,d\}$ embedded into a two dimensional unit cube. The center red point denotes the origin $(0,0)$ which is the hallucination region. (Right) An embedding of $\Delta_{\mathcal{Y}}$ where $\mathcal{Y} =\{a,b,c,d,e,f\}$ into a three-dimensional permutahedron: the beige region expresses strict calibration regions, the light pink regions expresses regions with inconsistency, and the auburn region expresses regions with hallucinations. For example, consider the report $u = \vec{0}$. Since losses are convex, if $p = (0, \frac{1}{2}, 0, 0, \frac{1}{2}, 0)$, then $\mathrm{conv}\,(\{b,e\})$ (dashed grey) is optimal, which includes $u$. However, $\vec{0}$ is also contained in $\mathrm{conv}\,(\{a,d\})$ which is optimal for the distribution $p' = (\frac{1}{2}, 0, 0, \frac{1}{2}, 0, 0)$. Therefore, we cannot distinguish the optimal reports for a hallucination at $\vec{0}$.
Figure 2: (Left) Corners represent the strict calibration regions for $\Theta_{\alpha}$ where $\mathcal{Y} =\{a,b,c,d \}$ is embedded into a two dimensional unit cube such that $\alpha = .25$. (Right) Auburn regions show that strict calibration holds for $\Theta_{\alpha}$ where $\mathcal{Y} =\{a,b,c,d,e,f\}$ is embedded into a three-dimensional permutahedron such that $\alpha =\frac{1}{3}-\epsilon$.
Figure 3: Four outcomes embedded in $\mathbb{R}^2$ in two different ways, with the minimizing reports $\bullet$ for a distribution $p$." (Left) Configuration $\varphi_{1}$ with $\bullet$ at $(-.5,.3)$ implying $p_a>p_d$ and $p_b>p_c$. (Right) Configuration $\varphi_{2}$ with $\bullet$ at $(0,0)$ implying $p_a=p_b$ and $p_c=p_d$. This implies the true distribution is $p = (0.4,0.4,0.1,0.1)$."

Theorems & Definitions (44)

Definition 1: Property, Elicits, Level Set
Definition 2: $\ell$-Calibrated Loss
Definition 3: $\ell$-Calibrated Property
Theorem 1: agarwal2015consistent
Definition 4: 0-1 Loss
Definition 5: Square Loss
Definition 6: $(L^2 ,\varphi )$ Induced Loss
Proposition 1
Definition 7: MAP Link
Definition 8: Hallucination
...and 34 more

Trading off Consistency and Dimensionality of Convex Surrogates for the Mode

TL;DR

Abstract

Trading off Consistency and Dimensionality of Convex Surrogates for the Mode

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (44)