Table of Contents
Fetching ...

CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet

Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Shuyang Gu, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu

TL;DR

Problem: CLIP's fine-tuning performance has been inconsistent and often undervalued compared to supervised pre-training. Approach: perform a comprehensive hyper-parameter study and apply off-the-shelf fine-tuning techniques to CLIP on ImageNet-1K. Contributions: establish a strong, reproducible CLIP fine-tuning baseline; show that small learning rates, EMA, layer-wise LR decay, shorter training, and cautious augmentations dramatically improve accuracy; achieve 85.7% Top-1 for ViT-B/16 and 88.0% Top-1 for ViT-L/14 at 224×224, with further gains at higher resolutions; compare favorably to supervised pre-training and MIM-target methods. Impact: provides a robust baseline, encourages rethinking CLIP-based improvements, and demonstrates CLIP's strong representation capabilities even with noisy web data.

Abstract

Recent studies have shown that CLIP has achieved remarkable success in performing zero-shot inference while its fine-tuning performance is not satisfactory. In this paper, we identify that fine-tuning performance is significantly impacted by hyper-parameter choices. We examine various key hyper-parameters and empirically evaluate their impact in fine-tuning CLIP for classification tasks through a comprehensive study. We find that the fine-tuning performance of CLIP is substantially underestimated. Equipped with hyper-parameter refinement, we demonstrate CLIP itself is better or at least competitive in fine-tuning compared with large-scale supervised pre-training approaches or latest works that use CLIP as prediction targets in Masked Image Modeling. Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset . These observations challenge the conventional conclusion that CLIP is not suitable for fine-tuning, and motivate us to rethink recently proposed improvements based on CLIP. We will release our code publicly at \url{https://github.com/LightDXY/FT-CLIP}.

CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet

TL;DR

Problem: CLIP's fine-tuning performance has been inconsistent and often undervalued compared to supervised pre-training. Approach: perform a comprehensive hyper-parameter study and apply off-the-shelf fine-tuning techniques to CLIP on ImageNet-1K. Contributions: establish a strong, reproducible CLIP fine-tuning baseline; show that small learning rates, EMA, layer-wise LR decay, shorter training, and cautious augmentations dramatically improve accuracy; achieve 85.7% Top-1 for ViT-B/16 and 88.0% Top-1 for ViT-L/14 at 224×224, with further gains at higher resolutions; compare favorably to supervised pre-training and MIM-target methods. Impact: provides a robust baseline, encourages rethinking CLIP-based improvements, and demonstrates CLIP's strong representation capabilities even with noisy web data.

Abstract

Recent studies have shown that CLIP has achieved remarkable success in performing zero-shot inference while its fine-tuning performance is not satisfactory. In this paper, we identify that fine-tuning performance is significantly impacted by hyper-parameter choices. We examine various key hyper-parameters and empirically evaluate their impact in fine-tuning CLIP for classification tasks through a comprehensive study. We find that the fine-tuning performance of CLIP is substantially underestimated. Equipped with hyper-parameter refinement, we demonstrate CLIP itself is better or at least competitive in fine-tuning compared with large-scale supervised pre-training approaches or latest works that use CLIP as prediction targets in Masked Image Modeling. Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset . These observations challenge the conventional conclusion that CLIP is not suitable for fine-tuning, and motivate us to rethink recently proposed improvements based on CLIP. We will release our code publicly at \url{https://github.com/LightDXY/FT-CLIP}.
Paper Structure (5 sections, 3 figures, 17 tables)

This paper contains 5 sections, 3 figures, 17 tables.

Figures (3)

  • Figure 1: Overview. We show the components changed to improve the CLIP fine-tuning performance. With a proper fine-tuning strategy, the CLIP model gets a comparable fine-tuning performance with the model supervisedly pre-trained on JFT. The "fine-tuning cost" denotes the GPU hours calculated with a single V100.
  • Figure 2: Training Length. Each figure shows the epoch-accuracy curve during the training. Top: 100 epoch fine-tuning setting, the model gets its best result with half of the training epochs and overfits the training set with the rest epochs. Bottom: 50 epoch fine-tuning setting, the model gets similar best accuracy and is under-fitting.
  • Figure 3: Partial fine-tuning results of CLIP-Base/16. The 0 layer tuning is linear probing and 12 is the full fine-tuning. The feature learned by CLIP is quite strong that freezing half of the layers gets $85.6\%$ top-1 accuracy, close to the full fine-tuning result.