Table of Contents
Fetching ...

Difference Vector Equalization for Robust Fine-tuning of Vision-Language Models

Satoshi Suzuki, Shin'ya Yamaguchi, Shoichiro Takeda, Taiga Yamane, Naoki Makishima, Naotaka Kawata, Mana Ihori, Tomohiro Tanaka, Shota Orihashi, Ryo Masumura

TL;DR

DiVE addresses the problem that robust fine-tuning of vision-language models often distorts the pre-trained embedding geometry, hurting OOD and zero-shot performance. It introduces Difference Vector Equalization, using two losses—Average Vector Loss and Pairwise Vector Loss—to constrain the fine-tuning updates so that the difference vectors between pre-trained and fine-tuned embeddings remain aligned globally and locally. Empirically, DiVE preserves geometric structure (as shown by RSA) and delivers strong ID, OOD, and zero-shot results across multiple datasets and architectures, outperforming prior robust fine-tuning methods while maintaining competitive ID accuracy. The method offers practical benefits for robust generalization in diverse deployment scenarios and suggests a new direction focused on preserving pre-training geometry during fine-tuning.

Abstract

Contrastive pre-trained vision-language models, such as CLIP, demonstrate strong generalization abilities in zero-shot classification by leveraging embeddings extracted from image and text encoders. This paper aims to robustly fine-tune these vision-language models on in-distribution (ID) data without compromising their generalization abilities in out-of-distribution (OOD) and zero-shot settings. Current robust fine-tuning methods tackle this challenge by reusing contrastive learning, which was used in pre-training, for fine-tuning. However, we found that these methods distort the geometric structure of the embeddings, which plays a crucial role in the generalization of vision-language models, resulting in limited OOD and zero-shot performance. To address this, we propose Difference Vector Equalization (DiVE), which preserves the geometric structure during fine-tuning. The idea behind DiVE is to constrain difference vectors, each of which is obtained by subtracting the embeddings extracted from the pre-trained and fine-tuning models for the same data sample. By constraining the difference vectors to be equal across various data samples, we effectively preserve the geometric structure. Therefore, we introduce two losses: average vector loss (AVL) and pairwise vector loss (PVL). AVL preserves the geometric structure globally by constraining difference vectors to be equal to their weighted average. PVL preserves the geometric structure locally by ensuring a consistent multimodal alignment. Our experiments demonstrate that DiVE effectively preserves the geometric structure, achieving strong results across ID, OOD, and zero-shot metrics.

Difference Vector Equalization for Robust Fine-tuning of Vision-Language Models

TL;DR

DiVE addresses the problem that robust fine-tuning of vision-language models often distorts the pre-trained embedding geometry, hurting OOD and zero-shot performance. It introduces Difference Vector Equalization, using two losses—Average Vector Loss and Pairwise Vector Loss—to constrain the fine-tuning updates so that the difference vectors between pre-trained and fine-tuned embeddings remain aligned globally and locally. Empirically, DiVE preserves geometric structure (as shown by RSA) and delivers strong ID, OOD, and zero-shot results across multiple datasets and architectures, outperforming prior robust fine-tuning methods while maintaining competitive ID accuracy. The method offers practical benefits for robust generalization in diverse deployment scenarios and suggests a new direction focused on preserving pre-training geometry during fine-tuning.

Abstract

Contrastive pre-trained vision-language models, such as CLIP, demonstrate strong generalization abilities in zero-shot classification by leveraging embeddings extracted from image and text encoders. This paper aims to robustly fine-tune these vision-language models on in-distribution (ID) data without compromising their generalization abilities in out-of-distribution (OOD) and zero-shot settings. Current robust fine-tuning methods tackle this challenge by reusing contrastive learning, which was used in pre-training, for fine-tuning. However, we found that these methods distort the geometric structure of the embeddings, which plays a crucial role in the generalization of vision-language models, resulting in limited OOD and zero-shot performance. To address this, we propose Difference Vector Equalization (DiVE), which preserves the geometric structure during fine-tuning. The idea behind DiVE is to constrain difference vectors, each of which is obtained by subtracting the embeddings extracted from the pre-trained and fine-tuning models for the same data sample. By constraining the difference vectors to be equal across various data samples, we effectively preserve the geometric structure. Therefore, we introduce two losses: average vector loss (AVL) and pairwise vector loss (PVL). AVL preserves the geometric structure globally by constraining difference vectors to be equal to their weighted average. PVL preserves the geometric structure locally by ensuring a consistent multimodal alignment. Our experiments demonstrate that DiVE effectively preserves the geometric structure, achieving strong results across ID, OOD, and zero-shot metrics.

Paper Structure

This paper contains 21 sections, 13 equations, 2 figures, 14 tables.

Figures (2)

  • Figure 1: (a) Illustrative example of generalization degradation. Performance of pre-trained models in out-of-distribution (OOD) and zero-shot (ZS) settings severely degrades after vanilla fine-tuning (FT) on in-distribution (ID) data. (b) Normalized performance of robust fine-tuning methods on ID, OOD, and ZS metrics. We used ImageNet as target task. DiVE performs well across all metrics.
  • Figure 2: Overview of our proposed method, Difference Vector Equalization (DiVE). It uses contrastive loss for fine-tuning. While fine-tuning on target data, it constrains all difference vectors for reference data ($\bm{x}^{\rm ref}$ and $\bm{t}^{\rm ref}$) to be equal.