Table of Contents
Fetching ...

Reframing Long-Tailed Learning via Loss Landscape Geometry

Shenghan Chen, Yiming Liu, Yanzhen Wang, Yujia Wang, Xiankai Lu

Abstract

Balancing performance trade-off on long-tail (LT) data distributions remains a long-standing challenge. In this paper, we posit that this dilemma stems from a phenomenon called "tail performance degradation" (the model tends to severely overfit on head classes while quickly forgetting tail classes) and pose a solution from a loss landscape perspective. We observe that different classes possess divergent convergence points in the loss landscape. Besides, this divergence is aggravated when the model settles into sharp and non-robust minima, rather than a shared and flat solution that is beneficial for all classes. In light of this, we propose a continual learning inspired framework to prevent "tail performance degradation". To avoid inefficient per-class parameter preservation, a Grouped Knowledge Preservation module is proposed to memorize group-specific convergence parameters, promoting convergence towards a shared solution. Concurrently, our framework integrates a Grouped Sharpness Aware module to seek flatter minima by explicitly addressing the geometry of the loss landscape. Notably, our framework requires neither external training samples nor pre-trained models, facilitating the broad applicability. Extensive experiments on four benchmarks demonstrate significant performance gains over state-of-the-art methods. The code is available at:https://gkp-gsa.github.io/.

Reframing Long-Tailed Learning via Loss Landscape Geometry

Abstract

Balancing performance trade-off on long-tail (LT) data distributions remains a long-standing challenge. In this paper, we posit that this dilemma stems from a phenomenon called "tail performance degradation" (the model tends to severely overfit on head classes while quickly forgetting tail classes) and pose a solution from a loss landscape perspective. We observe that different classes possess divergent convergence points in the loss landscape. Besides, this divergence is aggravated when the model settles into sharp and non-robust minima, rather than a shared and flat solution that is beneficial for all classes. In light of this, we propose a continual learning inspired framework to prevent "tail performance degradation". To avoid inefficient per-class parameter preservation, a Grouped Knowledge Preservation module is proposed to memorize group-specific convergence parameters, promoting convergence towards a shared solution. Concurrently, our framework integrates a Grouped Sharpness Aware module to seek flatter minima by explicitly addressing the geometry of the loss landscape. Notably, our framework requires neither external training samples nor pre-trained models, facilitating the broad applicability. Extensive experiments on four benchmarks demonstrate significant performance gains over state-of-the-art methods. The code is available at:https://gkp-gsa.github.io/.
Paper Structure (32 sections, 37 equations, 9 figures, 7 tables)

This paper contains 32 sections, 37 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: "Tail performance degradation" from the loss landscape view. Starting from a randomly initialized point $\theta(t_0)$ of LT model, (a) Training only on tail classes converges to $\theta(t_1)$ in a flat region, while (b) standard training on the long-tailed dataset converges to $\theta(t_2)$ in a sharp region. The optimization trajectory settles in $\theta(t_2)$, which causes tail performance degradation by diverging from the tail convergence point $\theta(t_1)$. In contrast, our optimization (red line) steers the model towards a solution $\theta^*$ that remains closer to the tail-class minimum $\theta(t_1)$ and resides in a flatter region ($\theta_1$ and $\theta_2$ on the axes denote projection directions for 2D visualization li2018visualizing).
  • Figure 2: (a) Standard training results in a sharp loss landscape for tail classes, where the corresponding feature quality peaks and then declines. (b) In contrast, our method flattens the landscape and preserves high feature quality for both head and tail classes.
  • Figure 3: Our framework consists of two key components: (a) The Grouped Sharpness Aware (GSA) module, which minimizes group-specific sharpness to find flat minima. (b) The Grouped Knowledge Preservation (GKP) module, which prevents tail performance degradation of other groups' optimal parameters.
  • Figure 4: Investigation on the perturbation direction. We decompose the original gradient $\nabla_\theta \mathcal{L}_{\mathcal{D}_g}(\theta)$ (green) into two components: the head-dominated global gradient $\nabla_\theta \mathcal{L}_D(\theta)$ (red) and the beneficial, group-specific gradient $\hat{\nabla}_\theta \mathcal{L}_{\mathcal{D}_g}(\theta)$ (blue).
  • Figure 5: Ablation Study on group numbers of our method. The shaded area indicates the fluctuation of accuracy.
  • ...and 4 more figures