Table of Contents
Fetching ...

Long-Tail Learning with Foundation Model: Heavy Fine-Tuning Hurts

Jiang-Xin Shi, Tong Wei, Zhi Zhou, Jie-Jing Shao, Xin-Yan Han, Yu-Feng Li

TL;DR

The paper reveals that heavy fine-tuning of foundation models can deteriorate tail-class performance in long-tail learning. It introduces LIFT, a lightweight, single-stage fine-tuning framework that uses structured lightweight modules, semantic-aware initialization, and test-time ensembling to preserve class-conditional distributions while boosting discriminative power. LIFT achieves competitive or superior results across ImageNet-LT, Places-LT, iNaturalist 2018, and CIFAR-100-LT with far fewer tunable parameters and epochs, often without external data. The work demonstrates rapid convergence (often under 20 epochs) and practical efficiency, providing a robust pathway for deploying foundation-model-based long-tail learners.

Abstract

The fine-tuning paradigm in addressing long-tail learning tasks has sparked significant interest since the emergence of foundation models. Nonetheless, how fine-tuning impacts performance in long-tail learning was not explicitly quantified. In this paper, we disclose that heavy fine-tuning may even lead to non-negligible performance deterioration on tail classes, and lightweight fine-tuning is more effective. The reason is attributed to inconsistent class conditions caused by heavy fine-tuning. With the observation above, we develop a low-complexity and accurate long-tail learning algorithms LIFT with the goal of facilitating fast prediction and compact models by adaptive lightweight fine-tuning. Experiments clearly verify that both the training time and the learned parameters are significantly reduced with more accurate predictive performance compared with state-of-the-art approaches. The implementation code is available at https://github.com/shijxcs/LIFT.

Long-Tail Learning with Foundation Model: Heavy Fine-Tuning Hurts

TL;DR

The paper reveals that heavy fine-tuning of foundation models can deteriorate tail-class performance in long-tail learning. It introduces LIFT, a lightweight, single-stage fine-tuning framework that uses structured lightweight modules, semantic-aware initialization, and test-time ensembling to preserve class-conditional distributions while boosting discriminative power. LIFT achieves competitive or superior results across ImageNet-LT, Places-LT, iNaturalist 2018, and CIFAR-100-LT with far fewer tunable parameters and epochs, often without external data. The work demonstrates rapid convergence (often under 20 epochs) and practical efficiency, providing a robust pathway for deploying foundation-model-based long-tail learners.

Abstract

The fine-tuning paradigm in addressing long-tail learning tasks has sparked significant interest since the emergence of foundation models. Nonetheless, how fine-tuning impacts performance in long-tail learning was not explicitly quantified. In this paper, we disclose that heavy fine-tuning may even lead to non-negligible performance deterioration on tail classes, and lightweight fine-tuning is more effective. The reason is attributed to inconsistent class conditions caused by heavy fine-tuning. With the observation above, we develop a low-complexity and accurate long-tail learning algorithms LIFT with the goal of facilitating fast prediction and compact models by adaptive lightweight fine-tuning. Experiments clearly verify that both the training time and the learned parameters are significantly reduced with more accurate predictive performance compared with state-of-the-art approaches. The implementation code is available at https://github.com/shijxcs/LIFT.
Paper Structure (51 sections, 1 theorem, 17 equations, 15 figures, 25 tables, 1 algorithm)

This paper contains 51 sections, 1 theorem, 17 equations, 15 figures, 25 tables, 1 algorithm.

Key Result

Proposition 3.1

Underestimated class-conditional probability ${\operatorname{P}}({\bm{x}}\mid y=j)$ leads to an underestimated loss on class $j$ and a biased prediction towards other classes.

Figures (15)

  • Figure 1: Comparison with state-of-the-art methods. The x-axis represents the number of learnable parameters and the y-axis shows the test accuracy. The size of each point corresponds to the number of training epochs, with larger points indicating longer training time. Gray labels denote methods that incorporate external data. LIFT consistently achieves higher performance with lower costs and is even comparable with methods that leverage external data.
  • Figure 2: (a-b) On ImageNet-LT and Places-LT, zero-shot CLIP has surpassed many prior methods. By simply introducing an additional classifier, the accuracy further increases. However, the improvements mainly come from the head classes, while the tail classes only achieve marginal enhancements. (c) On iNaturalist 2018, zero-shot CLIP encounters challenges in achieving high accuracy for fine-grained long-tail categories.
  • Figure 3: (a) Full fine-tuning improves head-class accuracy while decreasing tail-class accuracy, even if we optimize the balanced LA loss. (b-c) Inter-class feature similarities (heatmaps) and intra-class distributions from tail classes (histograms) on ImageNet-LT. Classifier fine-tuning limits head-class performance due to unoptimized inter-class similarities. Full fine-tuning optimizes inter-class similarities but leads to inconsistent distribution between train and test data on tail classes.
  • Figure 4: (a) Fine-tuning a small proportion of all parameters (e.g., 0.1%-2%) yields superior performance. As the proportion increases, performance deteriorates even when we search for the best learning rate. (b-c) Inter-class feature similarities (heatmaps) and intra-class distributions from tail classes (histograms) on ImageNet-LT. Both arbitrary and structured lightweight fine-tuning perform well in optimizing inter-class similarities and preserving intra-class distributions.
  • Figure 5: Convergence curve of mean class and tail class training accuracy.
  • ...and 10 more figures

Theorems & Definitions (3)

  • Proposition 3.1
  • proof
  • Remark 3.2