Table of Contents
Fetching ...

Finetune like you pretrain: Improved finetuning of zero-shot vision models

Sachin Goyal, Ananya Kumar, Sankalp Garg, Zico Kolter, Aditi Raghunathan

TL;DR

This work shows that finetuning CLIP-like vision-language models using the same contrastive loss as pretraining, guided by class-descriptive prompts, yields consistent ID and OOD improvements across a wide range of tasks. By updating both image and language encoders and avoiding cross-entropy finetuning, FLYP achieves state-of-the-art results on benchmarks such as WILDS iWILDCam and competitive gains on ImageNet shifts and few-shot scenarios. Ablations demonstrate that the gains arise from closely matching the pretraining objective rather than specific ensembling or prompt-template choices. The findings advocate for contrastive finetuning as a simple, robust, and broadly applicable baseline for downstream finetuning of image-text models.

Abstract

Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works like WiseFT (Wortsman et al., 2021) and LP-FT (Kumar et al., 2022) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for in-distribution (ID) and out-of-distribution (OOD) data. In this work, we show that a natural and simple approach of mimicking contrastive pretraining consistently outperforms alternative finetuning approaches. Specifically, we cast downstream class labels as text prompts and continue optimizing the contrastive loss between image embeddings and class-descriptive prompt embeddings (contrastive finetuning). Our method consistently outperforms baselines across 7 distribution shifts, 6 transfer learning, and 3 few-shot learning benchmarks. On WILDS-iWILDCam, our proposed approach FLYP outperforms the top of the leaderboard by $2.3\%$ ID and $2.7\%$ OOD, giving the highest reported accuracy. Averaged across 7 OOD datasets (2 WILDS and 5 ImageNet associated shifts), FLYP gives gains of $4.2\%$ OOD over standard finetuning and outperforms the current state of the art (LP-FT) by more than $1\%$ both ID and OOD. Similarly, on 3 few-shot learning benchmarks, our approach gives gains up to $4.6\%$ over standard finetuning and $4.4\%$ over the state of the art. In total, these benchmarks establish contrastive finetuning as a simple, intuitive, and state-of-the-art approach for supervised finetuning of image-text models like CLIP. Code is available at https://github.com/locuslab/FLYP.

Finetune like you pretrain: Improved finetuning of zero-shot vision models

TL;DR

This work shows that finetuning CLIP-like vision-language models using the same contrastive loss as pretraining, guided by class-descriptive prompts, yields consistent ID and OOD improvements across a wide range of tasks. By updating both image and language encoders and avoiding cross-entropy finetuning, FLYP achieves state-of-the-art results on benchmarks such as WILDS iWILDCam and competitive gains on ImageNet shifts and few-shot scenarios. Ablations demonstrate that the gains arise from closely matching the pretraining objective rather than specific ensembling or prompt-template choices. The findings advocate for contrastive finetuning as a simple, robust, and broadly applicable baseline for downstream finetuning of image-text models.

Abstract

Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works like WiseFT (Wortsman et al., 2021) and LP-FT (Kumar et al., 2022) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for in-distribution (ID) and out-of-distribution (OOD) data. In this work, we show that a natural and simple approach of mimicking contrastive pretraining consistently outperforms alternative finetuning approaches. Specifically, we cast downstream class labels as text prompts and continue optimizing the contrastive loss between image embeddings and class-descriptive prompt embeddings (contrastive finetuning). Our method consistently outperforms baselines across 7 distribution shifts, 6 transfer learning, and 3 few-shot learning benchmarks. On WILDS-iWILDCam, our proposed approach FLYP outperforms the top of the leaderboard by ID and OOD, giving the highest reported accuracy. Averaged across 7 OOD datasets (2 WILDS and 5 ImageNet associated shifts), FLYP gives gains of OOD over standard finetuning and outperforms the current state of the art (LP-FT) by more than both ID and OOD. Similarly, on 3 few-shot learning benchmarks, our approach gives gains up to over standard finetuning and over the state of the art. In total, these benchmarks establish contrastive finetuning as a simple, intuitive, and state-of-the-art approach for supervised finetuning of image-text models like CLIP. Code is available at https://github.com/locuslab/FLYP.
Paper Structure (38 sections, 3 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 38 sections, 3 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Finetune Like You Pretrain (FLYP ): Given a downstream classification dataset, standard finetuning approaches revolve around using the cross-entropy loss. In this work, we show that simply using the same loss as the pretraining i.e. contrastive loss, with "task supervision" coming from the text-description of labels, consistently outperforms state-of-the-art approaches like LP-FT lpft and WiseFT wiseft. For example, on ImageNet, our proposed approach outperforms LP-FT + weight ensembling by $1.1\%$ ID and $1.3\%$ OOD, with a ID-OOD frontier curve (orange curve) dominating those of the baselines, i.e. lies above and to the right of all the baselines.
  • Figure 2: Our proposed approach FLYP outperforms the baselines both ID and OOD, with or without weight ensembling wiseft. Here we show the ID-OOD frontier curves obtained by linearly interpolating the finetuned model weights with the zeroshot weights. The curves for FLYP completely dominate (lies above and to the right) those of the baselines on ImageNet, giving higher OOD accuracy for any ID accuracy. Comparing with ensembling corresponding to the best ID validation accuracy(stars), FLYP outperforms the current state of the art, LP-FT, by an average of $1.3\%$ OOD and $1.1\%$ ID and outperforms WiseFT (weight ensembled finetuning, wiseft) by an average of $2\%$ OOD and $1.6\%$ ID. We report exact numbers in Table \ref{['tab:big_distribution_shift_table']}.
  • Figure 3: We evaluate FLYP on few-shot classification on ImageNet, where it outperforms all the baselines with weight ensembling, giving gains of $1.5\%$ in $4$-shot classification and $0.8\%$ in $16$-shot classification over LP-FT.
  • Figure 4: FLYP 's performance is un-affected by the number of text-templates used during finetuning process. Here we compare using a single template versus 80 templates for text-descriptions on ImageNet dataset. Observe that FLYP with single template gives the same ID and OOD accuracy as FLYP with 80 templates, without ensembling. Note that the zeroshot model is also constructed using a single template, which causes a drop in it's accuracy, similar to the observations in clip.
  • Figure 5: Adding cross-entropy loss to FLYP's objective degrades the performance on ImageNet.