Table of Contents
Fetching ...

Long-tail learning via logit adjustment

Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, Sanjiv Kumar

TL;DR

This work tackles the challenge of long-tail label distributions by introducing logit-adjustment mechanisms for softmax classification. It provides a theoretical framing that ties logit adjustments to Bayes-optimal decision rules under the balanced error, and offers two practical realizations: a post-hoc logit translation and a logit-adjusted loss that incorporate class priors during training. The approach unifies and improves upon prior post-hoc and margin-based methods, demonstrating Fisher consistency for BER and strong empirical gains on synthetic and real-world long-tailed datasets, especially for rare classes. Collectively, the methods yield robust, principled improvements over standard training in imbalanced settings and offer scalable options for practitioners.

Abstract

Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels are associated with only a few samples. This poses a challenge for generalisation on such labels, and also makes naïve learning biased towards dominant labels. In this paper, we present two simple modifications of standard softmax cross-entropy training to cope with these challenges. Our techniques revisit the classic idea of logit adjustment based on the label frequencies, either applied post-hoc to a trained model, or enforced in the loss during training. Such adjustment encourages a large relative margin between logits of rare versus dominant labels. These techniques unify and generalise several recent proposals in the literature, while possessing firmer statistical grounding and empirical performance.

Long-tail learning via logit adjustment

TL;DR

This work tackles the challenge of long-tail label distributions by introducing logit-adjustment mechanisms for softmax classification. It provides a theoretical framing that ties logit adjustments to Bayes-optimal decision rules under the balanced error, and offers two practical realizations: a post-hoc logit translation and a logit-adjusted loss that incorporate class priors during training. The approach unifies and improves upon prior post-hoc and margin-based methods, demonstrating Fisher consistency for BER and strong empirical gains on synthetic and real-world long-tailed datasets, especially for rare classes. Collectively, the methods yield robust, principled improvements over standard training in imbalanced settings and offer scalable options for practitioners.

Abstract

Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels are associated with only a few samples. This poses a challenge for generalisation on such labels, and also makes naïve learning biased towards dominant labels. In this paper, we present two simple modifications of standard softmax cross-entropy training to cope with these challenges. Our techniques revisit the classic idea of logit adjustment based on the label frequencies, either applied post-hoc to a trained model, or enforced in the loss during training. Such adjustment encourages a large relative margin between logits of rare versus dominant labels. These techniques unify and generalise several recent proposals in the literature, while possessing firmer statistical grounding and empirical performance.

Paper Structure

This paper contains 26 sections, 3 theorems, 45 equations, 11 figures, 5 tables.

Key Result

Theorem 1

For any $\delta \in \mathbb{R}_+^L$, the pairwise loss in eqn:unified-margin-loss is Fisher consistent with weights and margins

Figures (11)

  • Figure 1: Mean and standard deviation over $5$ runs of per-class weight norms for a ResNet-32 under momentum and Adam optimisers. We use long-tailed ("LT") versions of CIFAR-10 and CIFAR-100, and sort classes in descending order of frequency; the first class is 100 times more likely to appear than the last class. Both optimisers yield solutions with comparable balanced error. However, the weight norms have incompatible trends: under momentum, the norms are strongly correlated with class frequency, while with Adam, the norms are anti-correlated or independent of the class frequency. Consequently, weight normalisation under Adam is ineffective for combatting class imbalance.
  • Figure 2: Results on synthetic binary classification problem. Our logit adjusted loss tracks the Bayes-optimal solution and separator (left & middle panel). Post-hoc logit adjustment matches the Bayes performance with suitable scaling (right panel); however, any weight normalisation fails.
  • Figure 3: Comparison of balanced error for post-hoc correction techniques when varying scaling parameter $\tau$ (c.f. \ref{['eqn:weight-normalisation']}, \ref{['eqn:logit-adjustment']}). Post-hoc logit adjustment consistently outperforms weight normalisation.
  • Figure 4: Per-class error rates of loss modification techniques. For (b) and (c), we aggregate the classes into 10 groups. ERM displays a strong bias towards dominant classes (lower indices). Our proposed logit adjusted softmax loss achieves significant gains on rare classes (higher indices).
  • Figure 5: Comparison of link functions for various losses assuming $\pi = 0.2$, with $\gamma = 1$ (left) and $\gamma = 8$ (right). The balanced loss uses $\omega_{y} = \frac{1}{\pi_{y}}$. The unequal margin loss uses $\delta_{y} = \frac{1}{\gamma} \cdot \log \frac{1 - \pi}{\pi}$. The balanced + margin loss uses $\delta_{-1} = \frac{\pi}{1-\pi}$, $\delta_{+1} = 1$, $\omega_{+1} = \frac{1}{\pi}$.
  • ...and 6 more figures

Theorems & Definitions (6)

  • Theorem 1
  • proof : Proof of Theorem \ref{['thm:multiclass-unified-consistent']}
  • Lemma 2
  • proof : Proof of Lemma \ref{['lemm:binary-unified-consistent']}
  • Lemma 3
  • proof : Proof of Lemma \ref{['lemm:binary-unified-proper']}