Table of Contents
Fetching ...

Minimax Generalized Cross-Entropy

Kartheek Bondugula, Santiago Mazuelas, Aritz Pérez, Anqi Liu

Abstract

Loss functions play a central role in supervised classification. Cross-entropy (CE) is widely used, whereas the mean absolute error (MAE) loss can offer robustness but is difficult to optimize. Interpolating between the CE and MAE losses, generalized cross-entropy (GCE) has recently been introduced to provide a trade-off between optimization difficulty and robustness. Existing formulations of GCE result in a non-convex optimization over classification margins that is prone to underfitting, leading to poor performances with complex datasets. In this paper, we propose a minimax formulation of generalized cross-entropy (MGCE) that results in a convex optimization over classification margins. Moreover, we show that MGCEs can provide an upper bound on the classification error. The proposed bilevel convex optimization can be efficiently implemented using stochastic gradient computed via implicit differentiation. Using benchmark datasets, we show that MGCE achieves strong accuracy, faster convergence, and better calibration, especially in the presence of label noise.

Minimax Generalized Cross-Entropy

Abstract

Loss functions play a central role in supervised classification. Cross-entropy (CE) is widely used, whereas the mean absolute error (MAE) loss can offer robustness but is difficult to optimize. Interpolating between the CE and MAE losses, generalized cross-entropy (GCE) has recently been introduced to provide a trade-off between optimization difficulty and robustness. Existing formulations of GCE result in a non-convex optimization over classification margins that is prone to underfitting, leading to poor performances with complex datasets. In this paper, we propose a minimax formulation of generalized cross-entropy (MGCE) that results in a convex optimization over classification margins. Moreover, we show that MGCEs can provide an upper bound on the classification error. The proposed bilevel convex optimization can be efficiently implemented using stochastic gradient computed via implicit differentiation. Using benchmark datasets, we show that MGCE achieves strong accuracy, faster convergence, and better calibration, especially in the presence of label noise.
Paper Structure (28 sections, 4 theorems, 64 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 28 sections, 4 theorems, 64 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Given a loss function $\ell_\beta$, if $\mathrm{h}_\beta$ is the minimax classifier in eq:soft_clf_mrc, the worst-case distribution $\mathrm{p}_\beta \in \arg \underset{\mathrm{p} \in \mathcal{U}}{\max} \ \ell_\beta(\mathrm{h}_\beta, \mathrm{p})$ is given by Reciprocally, if $\mathrm{p}_\beta$ is the worst-case distribution corresponding to the minimax problem in eq:minmaxrisk, that is, $\mathrm{

Figures (6)

  • Figure 1: Relation between $\beta$ and the resulting loss function. For $\beta =1$, the loss corresponds to the MAE while for $\beta=\infty$, it corresponds to CE. For $\beta \in(1,\infty)$, the loss interpolates between the MAE and CE.
  • Figure 2: Relation between the minimax classifier $\mathrm{h}_\beta(x)_y$ and the worst-case probability $\mathrm{p}_\beta(y|x)$ corresponding with 2 classes. For $\beta \in (1,\infty)$, the worst-case probabilities take a cautious stance, avoiding the extremes of MAE ($\beta=1$) and CE ($\beta=\infty$) losses.
  • Figure 3: Average test accuracy under clean training data obtained for multiple complex datasets. The value of loss parameter $\beta$ is set to 1.4. The figure shows the fast convergence of the proposed MGCE in comparison to the GCE.
  • Figure 4: Top-1 validation accuracy on the real-world noisy dataset WebVision. The figure shows that the proposed MGCE outperforms GCE, which significantly underfits on this complex dataset due to its non-convexity.
  • Figure 5: Top-1 test accuracy on the real-world noisy dataset Clothing-1M. The figure shows that the proposed MGCE outperforms GCE, which underfits on this complex dataset with 1 million training samples with noisy labels.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Remark 1
  • Theorem 3
  • proof
  • Corollary 1
  • proof