Table of Contents
Fetching ...

SAFT: Towards Out-of-Distribution Generalization in Fine-Tuning

Bac Nguyen, Stefan Uhlich, Fabien Cardinaux, Lukas Mauch, Marzieh Edraki, Aaron Courville

TL;DR

SAFT tackles the problem that fine-tuning large vision-language models like CLIP can hurt out-of-distribution generalization. It achieves this by selecting a tiny, task-relevant subset of parameters to update, based on gradient magnitudes from the downstream loss, and freezing the rest to preserve pre-trained knowledge. The approach yields strong OOD gains across ImageNet distribution shifts, base-to-new class generalization, and cross-dataset transfer, while remaining architecture-agnostic and applicable to NLP tasks as well. A theoretical generalization bound supports the idea that reducing the number of trainable parameters can improve in-domain generalization, and empirical results demonstrate SAFT’s practical impact with minimal parameter updates.

Abstract

Handling distribution shifts from training data, known as out-of-distribution (OOD) generalization, poses a significant challenge in the field of machine learning. While a pre-trained vision-language model like CLIP has demonstrated remarkable zero-shot performance, further adaptation of the model to downstream tasks leads to undesirable degradation for OOD data. In this work, we introduce Sparse Adaptation for Fine-Tuning (SAFT), a method that prevents fine-tuning from forgetting the general knowledge in the pre-trained model. SAFT only updates a small subset of important parameters whose gradient magnitude is large, while keeping the other parameters frozen. SAFT is straightforward to implement and conceptually simple. Extensive experiments show that with only 0.1% of the model parameters, SAFT can significantly improve the performance of CLIP. It consistently outperforms baseline methods across several benchmarks. On the few-shot learning benchmark of ImageNet and its variants, SAFT gives a gain of 5.15% on average over the conventional fine-tuning method in OOD settings.

SAFT: Towards Out-of-Distribution Generalization in Fine-Tuning

TL;DR

SAFT tackles the problem that fine-tuning large vision-language models like CLIP can hurt out-of-distribution generalization. It achieves this by selecting a tiny, task-relevant subset of parameters to update, based on gradient magnitudes from the downstream loss, and freezing the rest to preserve pre-trained knowledge. The approach yields strong OOD gains across ImageNet distribution shifts, base-to-new class generalization, and cross-dataset transfer, while remaining architecture-agnostic and applicable to NLP tasks as well. A theoretical generalization bound supports the idea that reducing the number of trainable parameters can improve in-domain generalization, and empirical results demonstrate SAFT’s practical impact with minimal parameter updates.

Abstract

Handling distribution shifts from training data, known as out-of-distribution (OOD) generalization, poses a significant challenge in the field of machine learning. While a pre-trained vision-language model like CLIP has demonstrated remarkable zero-shot performance, further adaptation of the model to downstream tasks leads to undesirable degradation for OOD data. In this work, we introduce Sparse Adaptation for Fine-Tuning (SAFT), a method that prevents fine-tuning from forgetting the general knowledge in the pre-trained model. SAFT only updates a small subset of important parameters whose gradient magnitude is large, while keeping the other parameters frozen. SAFT is straightforward to implement and conceptually simple. Extensive experiments show that with only 0.1% of the model parameters, SAFT can significantly improve the performance of CLIP. It consistently outperforms baseline methods across several benchmarks. On the few-shot learning benchmark of ImageNet and its variants, SAFT gives a gain of 5.15% on average over the conventional fine-tuning method in OOD settings.
Paper Structure (21 sections, 1 theorem, 8 equations, 7 figures, 14 tables, 1 algorithm)

This paper contains 21 sections, 1 theorem, 8 equations, 7 figures, 14 tables, 1 algorithm.

Key Result

Theorem 1

Let $\mathcal{G} = \{ f_{\tilde{\theta}} \mid \tilde{\theta} \in \tilde{\Theta} \}$ be a set of classifiers $f_{\tilde{\theta}}$, where $\tilde{\theta}$ consists of $d$ parameters each of which can have at most $r$ discrete values. Given a dataset of $N$ examples, there exists an $\tilde{\theta} \in

Figures (7)

  • Figure 1: Results for few-shot learning. We report the average accuracy on four distribution-shift variants of ImageNet deng2009imagenet, which are ImageNet-V2 recht2019imagenet, ImageNet-Sketch wang2019learning, ImageNet-A hendrycks2021natural, and ImageNet-R hendrycks2021many.
  • Figure 2: An overview of Sparse Adaptation for Fine-Tuning (SAFT). Our method consists of two phases: (I) We use the downstream dataset to select learnable parameters; (II) We fine-tune the model on the downstream dataset.
  • Figure 3: Top-5 retrieved images for a given prompt. Images are arranged from left to right in descending order of similarity to the given prompt. A green box indicates a correct match between image and text, while a red box indicates an incorrect match.
  • Figure 4: Performance difference in base-to-new generalization settings. We report the difference between SAFT and FT: (a) in new classes and (b) in base classes.
  • Figure 5: ID vs OOD performance with different sparsity levels.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Definition 1: $(\gamma, S)$-compressible using helper string $s$