Table of Contents
Fetching ...

MaxSup: Overcoming Representation Collapse in Label Smoothing

Yuxuan Zhou, Heng Li, Zhi-Qi Cheng, Xudong Yan, Yifei Dong, Mario Fritz, Margret Keuper

TL;DR

This work identifies two hidden flaws of Label Smoothing: overconfidence on misclassified samples and aggressive intra-class compression of features. It decomposes LS at the logit level into a regularization term and an error-amplification term, then introduces Max Suppression (MaxSup), which penalizes the top-1 logit instead of the ground-truth logit, providing uniform regularization for all predictions. Theoretical analysis and extensive experiments across ImageNet, CIFAR, ADE20K, and various architectures show that MaxSup preserves intra-class variation, sharpens inter-class boundaries, and improves transferability and downstream performance with minimal computational overhead. The results suggest MaxSup as a robust, easily integrable alternative to LS for improving generalization and representation quality in deep networks.

Abstract

Label Smoothing (LS) is widely adopted to reduce overconfidence in neural network predictions and improve generalization. Despite these benefits, recent studies reveal two critical issues with LS. First, LS induces overconfidence in misclassified samples. Second, it compacts feature representations into overly tight clusters, diluting intra-class diversity, although the precise cause of this phenomenon remained elusive. In this paper, we analytically decompose the LS-induced loss, exposing two key terms: (i) a regularization term that dampens overconfidence only when the prediction is correct, and (ii) an error-amplification term that arises under misclassifications. This latter term compels the network to reinforce incorrect predictions with undue certainty, exacerbating representation collapse. To address these shortcomings, we propose Max Suppression (MaxSup), which applies uniform regularization to both correct and incorrect predictions by penalizing the top-1 logit rather than the ground-truth logit. Through extensive feature-space analyses, we show that MaxSup restores intra-class variation and sharpens inter-class boundaries. Experiments on large-scale image classification and multiple downstream tasks confirm that MaxSup is a more robust alternative to LS. Code is available at: https://github.com/ZhouYuxuanYX/Maximum-Suppression-Regularization

MaxSup: Overcoming Representation Collapse in Label Smoothing

TL;DR

This work identifies two hidden flaws of Label Smoothing: overconfidence on misclassified samples and aggressive intra-class compression of features. It decomposes LS at the logit level into a regularization term and an error-amplification term, then introduces Max Suppression (MaxSup), which penalizes the top-1 logit instead of the ground-truth logit, providing uniform regularization for all predictions. Theoretical analysis and extensive experiments across ImageNet, CIFAR, ADE20K, and various architectures show that MaxSup preserves intra-class variation, sharpens inter-class boundaries, and improves transferability and downstream performance with minimal computational overhead. The results suggest MaxSup as a robust, easily integrable alternative to LS for improving generalization and representation quality in deep networks.

Abstract

Label Smoothing (LS) is widely adopted to reduce overconfidence in neural network predictions and improve generalization. Despite these benefits, recent studies reveal two critical issues with LS. First, LS induces overconfidence in misclassified samples. Second, it compacts feature representations into overly tight clusters, diluting intra-class diversity, although the precise cause of this phenomenon remained elusive. In this paper, we analytically decompose the LS-induced loss, exposing two key terms: (i) a regularization term that dampens overconfidence only when the prediction is correct, and (ii) an error-amplification term that arises under misclassifications. This latter term compels the network to reinforce incorrect predictions with undue certainty, exacerbating representation collapse. To address these shortcomings, we propose Max Suppression (MaxSup), which applies uniform regularization to both correct and incorrect predictions by penalizing the top-1 logit rather than the ground-truth logit. Through extensive feature-space analyses, we show that MaxSup restores intra-class variation and sharpens inter-class boundaries. Experiments on large-scale image classification and multiple downstream tasks confirm that MaxSup is a more robust alternative to LS. Code is available at: https://github.com/ZhouYuxuanYX/Maximum-Suppression-Regularization

Paper Structure

This paper contains 35 sections, 3 theorems, 37 equations, 5 figures, 16 tables, 1 algorithm.

Key Result

Lemma 3.2

Decomposition of Cross-Entropy Loss with Soft Labels. where Where, $\mathbf{q}$ is the predicted probability vector, $H(\cdot)$ denotes the Cross-Entropy, and $\frac{\mathbf{1}}{K}$ is the uniform distribution introduced by LS. This shows that LS adds a regularization term, $L_{\textit{LS}}$, which smooths the output distribution and helps to reduce overfitt

Figures (5)

  • Figure 1: Comparison of Label Smoothing (LS) and MaxSup. Left: MaxSup mitigates the intra-class compression induced by LS while preserving inter-class separability. Right: Grad-CAM visualizations show that MaxSup more effectively highlights class-discriminative regions than LS.
  • Figure 2: Grad-CAM Selvaraju_2019 visualizations for DeiT-Small models under three training setups: MaxSup (2nd row), Label Smoothing (3rd row), and a baseline (4th row). The first row shows the original images. Compared to Label Smoothing, MaxSup more effectively filters out non-target regions and highlights essential features of the target class, reducing instances where the model partially or entirely focuses on irrelevant areas.
  • Figure 3: Visualization of penultimate-layer activations from DeiT-Small (trained with CutMix and Mixup) on the ImageNet validation set. The top row shows embeddings for a MaxSup-trained model, and the bottom row shows embeddings for a Label Smoothing (LS)–trained model. In each subfigure, classes are either semantically similar or confusingly labeled. Compared to LS, MaxSup yields more pronounced inter-class separability and richer intra-class diversity, suggesting stronger representation and classification performance.
  • Figure 4: Visualization of the penultimate-layer activations for DeiT-Small (trained with CutMix and Mixup) on selected ImageNet classes. The top row shows results for a MaxSup-trained model; the bottom row shows Label Smoothing (LS). In (a,b), the model must distinguish semantically similar classes (e.g., Saluki vs. Grey Fox; Tow Truck vs. Pickup), while (c,d) involve confusing categories (e.g., Jean vs. Shoe Shop, Stinkhorn vs. related objects). Compared to LS, MaxSup yields both improved inter-class separability and richer intra-class variation, indicating more robust representation learning.
  • Figure 5: Comparison of logit distributions under different regularizers.

Theorems & Definitions (7)

  • Definition 3.1
  • Lemma 3.2
  • Theorem 3.3
  • Corollary 3.4
  • Definition 3.5
  • proof
  • proof