Table of Contents
Fetching ...

Robust Fine-tuning of Zero-shot Models via Variance Reduction

Beier Zhu, Jiequan Cui, Hanwang Zhang

TL;DR

This work proposes a sample-wise ensembling technique that can simultaneously attain the best ID and OOD accuracy without the trade-offs, and term it Variance Reduction Fine-tuning (VRF), as it effectively reduces the variance in ensemble predictions, thereby decreasing residual error.

Abstract

When fine-tuning zero-shot models like CLIP, our desideratum is for the fine-tuned model to excel in both in-distribution (ID) and out-of-distribution (OOD). Recently, ensemble-based models (ESM) have been shown to offer significant robustness improvement, while preserving high ID accuracy. However, our study finds that ESMs do not solve the ID-OOD trade-offs: they achieve peak performance for ID and OOD accuracy at different mixing coefficients. When optimized for OOD accuracy, the ensemble model exhibits a noticeable decline in ID accuracy, and vice versa. In contrast, we propose a sample-wise ensembling technique that can simultaneously attain the best ID and OOD accuracy without the trade-offs. Specifically, we construct a Zero-Shot Failure (ZSF) set containing training samples incorrectly predicted by the zero-shot model. For each test sample, we calculate its distance to the ZSF set and assign a higher weight to the fine-tuned model in the ensemble if the distance is small. We term our method Variance Reduction Fine-tuning (VRF), as it effectively reduces the variance in ensemble predictions, thereby decreasing residual error. On ImageNet and five derived distribution shifts, our VRF further improves the OOD accuracy by 1.5 - 2.0 pp over the ensemble baselines while maintaining or increasing ID accuracy. VRF achieves similar large robustness gains (0.9 - 3.1 pp) on other distribution shifts benchmarks. Codes are available in https://github.com/BeierZhu/VRF.

Robust Fine-tuning of Zero-shot Models via Variance Reduction

TL;DR

This work proposes a sample-wise ensembling technique that can simultaneously attain the best ID and OOD accuracy without the trade-offs, and term it Variance Reduction Fine-tuning (VRF), as it effectively reduces the variance in ensemble predictions, thereby decreasing residual error.

Abstract

When fine-tuning zero-shot models like CLIP, our desideratum is for the fine-tuned model to excel in both in-distribution (ID) and out-of-distribution (OOD). Recently, ensemble-based models (ESM) have been shown to offer significant robustness improvement, while preserving high ID accuracy. However, our study finds that ESMs do not solve the ID-OOD trade-offs: they achieve peak performance for ID and OOD accuracy at different mixing coefficients. When optimized for OOD accuracy, the ensemble model exhibits a noticeable decline in ID accuracy, and vice versa. In contrast, we propose a sample-wise ensembling technique that can simultaneously attain the best ID and OOD accuracy without the trade-offs. Specifically, we construct a Zero-Shot Failure (ZSF) set containing training samples incorrectly predicted by the zero-shot model. For each test sample, we calculate its distance to the ZSF set and assign a higher weight to the fine-tuned model in the ensemble if the distance is small. We term our method Variance Reduction Fine-tuning (VRF), as it effectively reduces the variance in ensemble predictions, thereby decreasing residual error. On ImageNet and five derived distribution shifts, our VRF further improves the OOD accuracy by 1.5 - 2.0 pp over the ensemble baselines while maintaining or increasing ID accuracy. VRF achieves similar large robustness gains (0.9 - 3.1 pp) on other distribution shifts benchmarks. Codes are available in https://github.com/BeierZhu/VRF.

Paper Structure

This paper contains 23 sections, 12 equations, 13 figures, 13 tables, 1 algorithm.

Figures (13)

  • Figure 1: (a) ID-OOD frontier curves for the CLIP ViT-B/16 model on the ID (ImageNet) and OOD (IN-{V2, R, A, Sketch} and ObjectNet) datasets by varying the mixing coefficient $\alpha$. The ensemble model achieves its best ID and OOD performance at different $\alpha$ values. Our method VRF simultaneously attains the best ID and OOD accuracy, outperforming the ensemble by $3.6\%$ on OOD and $1.6\%$ on ID at its optimal performance points.(b) Relationship between the ratio of fine-tuned accuracy to zero-shot accuracy ($\frac{\text{Acc}_\mathsf{ft}}{\text{Acc}_\mathsf{zs}}$) and the distance to the zero-shot failure set ($d(\mathbf{x})$). $\frac{\text{Acc}_\mathsf{ft}}{\text{Acc}_\mathsf{zs}}$ demonstrates a monotonic decrease as $d(\mathbf{x})$ increases.
  • Figure 2: Relationship between $\frac{\text{Acc}_\mathsf{ft}}{\text{Acc}_\mathsf{zs}}$ and the weight $\omega(\mathbf{x})$.
  • Figure 3: ID-OOD frontier curves by varying the mixing coefficient $\alpha$ and $\frac{\text{Acc}_\mathsf{ft}}{\text{Acc}_\mathsf{zs}}$ curves for the CLIP ViT-B/16 . (a) CIFAR-10 (ID) and STL-10 (OOD) results. (b) Entity-30 results.
  • Figure 4: ZSF set $\mathcal{V}$ vs. all data $\mathcal{D}$
  • Figure 5: (a) Averaged weight $\mathbb{E}_{\mathbf{x}}[\omega(\mathbf{x})]$ on different datasets. (b) VRF based on logit-space ensembling. (c) Comparison with the effect of different $k$ in the $k$-NN distance.
  • ...and 8 more figures