Table of Contents
Fetching ...

Aligning Visual Contrastive learning models via Preference Optimization

Amirabbas Afzali, Borna Khodabandeh, Ali Rasekh, Mahyar JafariNodeh, Sepehr kazemi, Simon Gottschalk

TL;DR

This work addresses the vulnerability of vision-language models to typographic attacks and bias by importing Preference Optimization (RLHF, DPO, IPO, KTO) into contrastive learning. It frames contrastive CLIP-style training as a one-step MDP and uses two data streams—preference data $ ext{D}_{ ext{pref}}$ and regularization data $ ext{D}_{ ext{reg}}$—to align behavior with human preferences while preserving pretrained knowledge via KL regularization. The methodology adapts DPO, IPO, and KTO to the non-generative setting, deriving gradient updates that push image embeddings toward the difference between preferred and dispreferred text representations, and introduces a linear transformation on embeddings (with SVD) to control concept directions. Experiments on typographic robustness and gender-bias disentanglement demonstrate improved adversarial robustness and fairness, with analysis showing better retention of pretrained knowledge compared to standard cross-entropy fine-tuning, and the ability to modulate task emphasis via the transformation scale $t$ and hyperparameters. The results suggest that PO can effectively align non-generative models to targeted preferences, offering practical benefits for fairness, robustness, and task-specific alignment in CLIP-like systems.

Abstract

Contrastive learning models have demonstrated impressive abilities to capture semantic similarities by aligning representations in the embedding space. However, their performance can be limited by the quality of the training data and its inherent biases. While Preference Optimization (PO) methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have been applied to align generative models with human preferences, their use in contrastive learning has yet to be explored. This paper introduces a novel method for training contrastive learning models using different PO methods to break down complex concepts. Our method systematically aligns model behavior with desired preferences, enhancing performance on the targeted task. In particular, we focus on enhancing model robustness against typographic attacks and inductive biases, commonly seen in contrastive vision-language models like CLIP. Our experiments demonstrate that models trained using PO outperform standard contrastive learning techniques while retaining their ability to handle adversarial challenges and maintain accuracy on other downstream tasks. This makes our method well-suited for tasks requiring fairness, robustness, and alignment with specific preferences. We evaluate our method for tackling typographic attacks on images and explore its ability to disentangle gender concepts and mitigate gender bias, showcasing the versatility of our approach.

Aligning Visual Contrastive learning models via Preference Optimization

TL;DR

This work addresses the vulnerability of vision-language models to typographic attacks and bias by importing Preference Optimization (RLHF, DPO, IPO, KTO) into contrastive learning. It frames contrastive CLIP-style training as a one-step MDP and uses two data streams—preference data and regularization data —to align behavior with human preferences while preserving pretrained knowledge via KL regularization. The methodology adapts DPO, IPO, and KTO to the non-generative setting, deriving gradient updates that push image embeddings toward the difference between preferred and dispreferred text representations, and introduces a linear transformation on embeddings (with SVD) to control concept directions. Experiments on typographic robustness and gender-bias disentanglement demonstrate improved adversarial robustness and fairness, with analysis showing better retention of pretrained knowledge compared to standard cross-entropy fine-tuning, and the ability to modulate task emphasis via the transformation scale and hyperparameters. The results suggest that PO can effectively align non-generative models to targeted preferences, offering practical benefits for fairness, robustness, and task-specific alignment in CLIP-like systems.

Abstract

Contrastive learning models have demonstrated impressive abilities to capture semantic similarities by aligning representations in the embedding space. However, their performance can be limited by the quality of the training data and its inherent biases. While Preference Optimization (PO) methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have been applied to align generative models with human preferences, their use in contrastive learning has yet to be explored. This paper introduces a novel method for training contrastive learning models using different PO methods to break down complex concepts. Our method systematically aligns model behavior with desired preferences, enhancing performance on the targeted task. In particular, we focus on enhancing model robustness against typographic attacks and inductive biases, commonly seen in contrastive vision-language models like CLIP. Our experiments demonstrate that models trained using PO outperform standard contrastive learning techniques while retaining their ability to handle adversarial challenges and maintain accuracy on other downstream tasks. This makes our method well-suited for tasks requiring fairness, robustness, and alignment with specific preferences. We evaluate our method for tackling typographic attacks on images and explore its ability to disentangle gender concepts and mitigate gender bias, showcasing the versatility of our approach.

Paper Structure

This paper contains 39 sections, 4 theorems, 43 equations, 17 figures, 7 tables, 2 algorithms.

Key Result

Lemma 3.1

Under the assumption that the text encoder is frozen, i.e., $\mathcal{T}_\text{ref} = \mathcal{T}_\theta = \mathcal{T}$, the policy ratio for models using the contrastive learning policy, in the methods such as DPO or IPO can be expressed as:See Appendix pr:lemma1 for the proof.

Figures (17)

  • Figure 1: Overview of our proposed approach. On the left side, we calculate the preference optimization loss $\mathcal{L}_{\text{pref}}(\pi_\theta, \pi_\text{ref};\mathcal{D}_\text{pref})$ using the preference dataset, and the output logits $\mathcal{T}(y_w)^T\mathcal{I}(x), \mathcal{T}(y_l)^T\mathcal{I}(x)$ from both models. On the right side, the regulatory loss $\mathcal{L}_\text{reg}(\pi_\theta, \pi_\text{ref};\mathcal{D}_\text{reg})$ is calculated using the regularization dataset. The snowflake icons denote frozen encoders.
  • Figure 2: Comparisons of optical character recognition (OCR) and object detection (OD).
  • Figure 3: Analyses of the models' understanding of gender.
  • Figure 4: Images retrieved for the caption "an image of a police" with three different policies from top to bottom: reversed understanding of gender (6W, 4M), pretrained CLIP model (2W, 8M), neutralized understanding of gender (5W, 5M), i.e., $t=t^*$.
  • Figure 5: KL-divergence studies.
  • ...and 12 more figures

Theorems & Definitions (4)

  • Lemma 3.1
  • Corollary 3.2
  • Theorem 3.3
  • Proposition 3.4