MaxSup: Overcoming Representation Collapse in Label Smoothing
Yuxuan Zhou, Heng Li, Zhi-Qi Cheng, Xudong Yan, Yifei Dong, Mario Fritz, Margret Keuper
TL;DR
This work identifies two hidden flaws of Label Smoothing: overconfidence on misclassified samples and aggressive intra-class compression of features. It decomposes LS at the logit level into a regularization term and an error-amplification term, then introduces Max Suppression (MaxSup), which penalizes the top-1 logit instead of the ground-truth logit, providing uniform regularization for all predictions. Theoretical analysis and extensive experiments across ImageNet, CIFAR, ADE20K, and various architectures show that MaxSup preserves intra-class variation, sharpens inter-class boundaries, and improves transferability and downstream performance with minimal computational overhead. The results suggest MaxSup as a robust, easily integrable alternative to LS for improving generalization and representation quality in deep networks.
Abstract
Label Smoothing (LS) is widely adopted to reduce overconfidence in neural network predictions and improve generalization. Despite these benefits, recent studies reveal two critical issues with LS. First, LS induces overconfidence in misclassified samples. Second, it compacts feature representations into overly tight clusters, diluting intra-class diversity, although the precise cause of this phenomenon remained elusive. In this paper, we analytically decompose the LS-induced loss, exposing two key terms: (i) a regularization term that dampens overconfidence only when the prediction is correct, and (ii) an error-amplification term that arises under misclassifications. This latter term compels the network to reinforce incorrect predictions with undue certainty, exacerbating representation collapse. To address these shortcomings, we propose Max Suppression (MaxSup), which applies uniform regularization to both correct and incorrect predictions by penalizing the top-1 logit rather than the ground-truth logit. Through extensive feature-space analyses, we show that MaxSup restores intra-class variation and sharpens inter-class boundaries. Experiments on large-scale image classification and multiple downstream tasks confirm that MaxSup is a more robust alternative to LS. Code is available at: https://github.com/ZhouYuxuanYX/Maximum-Suppression-Regularization
