Table of Contents
Fetching ...

Trading off Consistency and Dimensionality of Convex Surrogates for the Mode

Enrique Nueve, Bo Waggoner, Dhamma Kimpara, Jessie Finocchiaro

TL;DR

This work reevaluates the classic multiclass surrogate-design problem by relaxing the requirement of full $d\ge n-1$ consistency and exploring partial consistency through polytope embeddings. By embedding outcomes into low-dimensional polytopes and using a square-loss surrogate with a MAP link, the authors demonstrate that hallucination regions inevitably arise when $d<n-1$, but calibration regions exist under structured low-noise assumptions, with concrete instantiations such as $n=2^{d}$ in a unit cube and $n=d!$ in a permutahedron. They further show that, under low-noise, calibration can be preserved for these embeddings, and that using multiple problem instances enables reliable mode elicitation in dimension $d$ roughly half the number of outcomes, offering a scalable alternative to full $n-1$-dimensional surrogates. These insights provide practical guidance for selecting embeddings and problem-instance strategies in large-scale multiclass or structured prediction tasks, with implications for calibration guarantees and parallelizable optimization.

Abstract

In multiclass classification over $n$ outcomes, the outcomes must be embedded into the reals with dimension at least $n-1$ in order to design a consistent surrogate loss that leads to the "correct" classification, regardless of the data distribution. For large $n$, such as in information retrieval and structured prediction tasks, optimizing a surrogate in $n-1$ dimensions is often intractable. We investigate ways to trade off surrogate loss dimension, the number of problem instances, and restricting the region of consistency in the simplex for multiclass classification. Following past work, we examine an intuitive embedding procedure that maps outcomes into the vertices of convex polytopes in a low-dimensional surrogate space. We show that full-dimensional subsets of the simplex exist around each point mass distribution for which consistency holds, but also, with less than $n-1$ dimensions, there exist distributions for which a phenomenon called hallucination occurs, which is when the optimal report under the surrogate loss is an outcome with zero probability. Looking towards application, we derive a result to check if consistency holds under a given polytope embedding and low-noise assumption, providing insight into when to use a particular embedding. We provide examples of embedding $n = 2^{d}$ outcomes into the $d$-dimensional unit cube and $n = d!$ outcomes into the $d$-dimensional permutahedron under low-noise assumptions. Finally, we demonstrate that with multiple problem instances, we can learn the mode with $\frac{n}{2}$ dimensions over the whole simplex.

Trading off Consistency and Dimensionality of Convex Surrogates for the Mode

TL;DR

This work reevaluates the classic multiclass surrogate-design problem by relaxing the requirement of full consistency and exploring partial consistency through polytope embeddings. By embedding outcomes into low-dimensional polytopes and using a square-loss surrogate with a MAP link, the authors demonstrate that hallucination regions inevitably arise when , but calibration regions exist under structured low-noise assumptions, with concrete instantiations such as in a unit cube and in a permutahedron. They further show that, under low-noise, calibration can be preserved for these embeddings, and that using multiple problem instances enables reliable mode elicitation in dimension roughly half the number of outcomes, offering a scalable alternative to full -dimensional surrogates. These insights provide practical guidance for selecting embeddings and problem-instance strategies in large-scale multiclass or structured prediction tasks, with implications for calibration guarantees and parallelizable optimization.

Abstract

In multiclass classification over outcomes, the outcomes must be embedded into the reals with dimension at least in order to design a consistent surrogate loss that leads to the "correct" classification, regardless of the data distribution. For large , such as in information retrieval and structured prediction tasks, optimizing a surrogate in dimensions is often intractable. We investigate ways to trade off surrogate loss dimension, the number of problem instances, and restricting the region of consistency in the simplex for multiclass classification. Following past work, we examine an intuitive embedding procedure that maps outcomes into the vertices of convex polytopes in a low-dimensional surrogate space. We show that full-dimensional subsets of the simplex exist around each point mass distribution for which consistency holds, but also, with less than dimensions, there exist distributions for which a phenomenon called hallucination occurs, which is when the optimal report under the surrogate loss is an outcome with zero probability. Looking towards application, we derive a result to check if consistency holds under a given polytope embedding and low-noise assumption, providing insight into when to use a particular embedding. We provide examples of embedding outcomes into the -dimensional unit cube and outcomes into the -dimensional permutahedron under low-noise assumptions. Finally, we demonstrate that with multiple problem instances, we can learn the mode with dimensions over the whole simplex.
Paper Structure (22 sections, 20 theorems, 20 equations, 3 figures, 3 tables, 2 algorithms)

This paper contains 22 sections, 20 theorems, 20 equations, 3 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

Let $\ell :\mathcal{Y}\times \mathcal{Y} \to \mathbb{R}_{+}$ and $\mathcal{P}\subseteq \Delta_{\mathcal{Y}}$. Let $\Gamma :\mathcal{P}\rightrightarrows \mathbb{R}^{d}$ and $\psi :\mathbb{R}^{d}\to \mathcal{Y}$ be such that $\Gamma$ is elicitable and $(\Gamma ,\psi )$ is an $\ell$-calibrated property

Figures (3)

  • Figure 1: (Left) Mode level sets of $\Delta_{\mathcal{Y}}$ where $\mathcal{Y} =\{a,b,c,d\}$ embedded into a two dimensional unit cube. The center red point denotes the origin $(0,0)$ which is the hallucination region. (Right) An embedding of $\Delta_{\mathcal{Y}}$ where $\mathcal{Y} =\{a,b,c,d,e,f\}$ into a three-dimensional permutahedron: the beige region expresses strict calibration regions, the light pink regions expresses regions with inconsistency, and the auburn region expresses regions with hallucinations. For example, consider the report $u = \vec{0}$. Since losses are convex, if $p = (0, \frac{1}{2}, 0, 0, \frac{1}{2}, 0)$, then $\mathrm{conv}\,(\{b,e\})$ (dashed grey) is optimal, which includes $u$. However, $\vec{0}$ is also contained in $\mathrm{conv}\,(\{a,d\})$ which is optimal for the distribution $p' = (\frac{1}{2}, 0, 0, \frac{1}{2}, 0, 0)$. Therefore, we cannot distinguish the optimal reports for a hallucination at $\vec{0}$.
  • Figure 2: (Left) Corners represent the strict calibration regions for $\Theta_{\alpha}$ where $\mathcal{Y} =\{a,b,c,d \}$ is embedded into a two dimensional unit cube such that $\alpha = .25$. (Right) Auburn regions show that strict calibration holds for $\Theta_{\alpha}$ where $\mathcal{Y} =\{a,b,c,d,e,f\}$ is embedded into a three-dimensional permutahedron such that $\alpha =\frac{1}{3}-\epsilon$.
  • Figure 3: Four outcomes embedded in $\mathbb{R}^2$ in two different ways, with the minimizing reports $\bullet$ for a distribution $p$." (Left) Configuration $\varphi_{1}$ with $\bullet$ at $(-.5,.3)$ implying $p_a>p_d$ and $p_b>p_c$. (Right) Configuration $\varphi_{2}$ with $\bullet$ at $(0,0)$ implying $p_a=p_b$ and $p_c=p_d$. This implies the true distribution is $p = (0.4,0.4,0.1,0.1)$."

Theorems & Definitions (44)

  • Definition 1: Property, Elicits, Level Set
  • Definition 2: $\ell$-Calibrated Loss
  • Definition 3: $\ell$-Calibrated Property
  • Theorem 1: agarwal2015consistent
  • Definition 4: 0-1 Loss
  • Definition 5: Square Loss
  • Definition 6: $(L^2 ,\varphi )$ Induced Loss
  • Proposition 1
  • Definition 7: MAP Link
  • Definition 8: Hallucination
  • ...and 34 more