Table of Contents
Fetching ...

Generalized Logit Adjustment: Calibrating Fine-tuned Models by Removing Label Bias in Foundation Models

Beier Zhu, Kaihua Tang, Qianru Sun, Hanwang Zhang

TL;DR

This paper identifies label bias embedded in foundation models due to web-scale pre-training and proposes Generalized Logit Adjustment (GLA), a post-hoc debiasing and ensembling method that combines a debiased zero-shot model with a fine-tuned model. GLA estimates the pre-training label prior from downstream data and enforces an equal-weighted ensemble after subtracting both pre-training and downstream priors, yielding a Bayes-optimal classifier for balanced target distributions. Theoretical analysis shows GLA minimizes target misclassification risk and outperforms both individual models and naive ensembles under typical conditions. Empirically, GLA delivers consistent gains across many-shot, few-shot, and long-tail tasks, including a 1.5 percentage point improvement on ImageNet and substantial gains on 11 few-shot datasets, demonstrating practical impact for robust foundation-model deployment.

Abstract

Foundation models like CLIP allow zero-shot transfer on various tasks without additional training data. Yet, the zero-shot performance is less competitive than a fully supervised one. Thus, to enhance the performance, fine-tuning and ensembling are also commonly adopted to better fit the downstream tasks. However, we argue that such prior work has overlooked the inherent biases in foundation models. Due to the highly imbalanced Web-scale training set, these foundation models are inevitably skewed toward frequent semantics, and thus the subsequent fine-tuning or ensembling is still biased. In this study, we systematically examine the biases in foundation models and demonstrate the efficacy of our proposed Generalized Logit Adjustment (GLA) method. Note that bias estimation in foundation models is challenging, as most pre-train data cannot be explicitly accessed like in traditional long-tailed classification tasks. To this end, GLA has an optimization-based bias estimation approach for debiasing foundation models. As our work resolves a fundamental flaw in the pre-training, the proposed GLA demonstrates significant improvements across a diverse range of tasks: it achieves 1.5 pp accuracy gains on ImageNet, an large average improvement (1.4-4.6 pp) on 11 few-shot datasets, 2.4 pp gains on long-tailed classification. Codes are in https://github.com/BeierZhu/GLA.

Generalized Logit Adjustment: Calibrating Fine-tuned Models by Removing Label Bias in Foundation Models

TL;DR

This paper identifies label bias embedded in foundation models due to web-scale pre-training and proposes Generalized Logit Adjustment (GLA), a post-hoc debiasing and ensembling method that combines a debiased zero-shot model with a fine-tuned model. GLA estimates the pre-training label prior from downstream data and enforces an equal-weighted ensemble after subtracting both pre-training and downstream priors, yielding a Bayes-optimal classifier for balanced target distributions. Theoretical analysis shows GLA minimizes target misclassification risk and outperforms both individual models and naive ensembles under typical conditions. Empirically, GLA delivers consistent gains across many-shot, few-shot, and long-tail tasks, including a 1.5 percentage point improvement on ImageNet and substantial gains on 11 few-shot datasets, demonstrating practical impact for robust foundation-model deployment.

Abstract

Foundation models like CLIP allow zero-shot transfer on various tasks without additional training data. Yet, the zero-shot performance is less competitive than a fully supervised one. Thus, to enhance the performance, fine-tuning and ensembling are also commonly adopted to better fit the downstream tasks. However, we argue that such prior work has overlooked the inherent biases in foundation models. Due to the highly imbalanced Web-scale training set, these foundation models are inevitably skewed toward frequent semantics, and thus the subsequent fine-tuning or ensembling is still biased. In this study, we systematically examine the biases in foundation models and demonstrate the efficacy of our proposed Generalized Logit Adjustment (GLA) method. Note that bias estimation in foundation models is challenging, as most pre-train data cannot be explicitly accessed like in traditional long-tailed classification tasks. To this end, GLA has an optimization-based bias estimation approach for debiasing foundation models. As our work resolves a fundamental flaw in the pre-training, the proposed GLA demonstrates significant improvements across a diverse range of tasks: it achieves 1.5 pp accuracy gains on ImageNet, an large average improvement (1.4-4.6 pp) on 11 few-shot datasets, 2.4 pp gains on long-tailed classification. Codes are in https://github.com/BeierZhu/GLA.
Paper Structure (27 sections, 5 theorems, 20 equations, 7 figures, 11 tables)

This paper contains 27 sections, 5 theorems, 20 equations, 7 figures, 11 tables.

Key Result

Lemma 1

The Bayes optimal classifier $y^*$ for $P$ has lower risk than all classifiers $\hat{y}: \mathcal{X} \rightarrow \mathcal{Y}$.

Figures (7)

  • Figure 1: (a) Per class accuracy of CLIP-ViT/B16 on ImageNet. Class index are sorted using the estimated pre-training label prior. Curves are smoothed for better visualization. (b) Beak-down performance of different models on ImageNet. We equally divide the ImageNet classes into three subgroups, according to the class index. Existing ensemble methods like WiSE-FT wortsman2022robust exhibits a clear performance loss on tail classes, while our GLA stands out for all three subgroups.
  • Figure 2: Illustration of debiasing process on ImageNet validation set. (a) The original distribution of zero-shot outputs; (b) the estimated pre-train distribution $\mathbf{q}$ based on our algorithm; (c) the distribution of debiased zero-shot outputs using estimated $\mathbf{q}$.
  • Figure 3: Estimating label bias of CIFAR-10-LT-IB-10.
  • Figure 4: Accuracy with mixing coefficient $\alpha$.
  • Figure 4: Evaluation on robustness to distribution shift at 16 training shots.
  • ...and 2 more figures

Theorems & Definitions (11)

  • Definition 1
  • Definition 2
  • Definition 3
  • Lemma 1
  • Lemma 2
  • Proposition 1
  • proof
  • Corollary 1
  • Proposition 2
  • proof
  • ...and 1 more