Aligning Visual Contrastive learning models via Preference Optimization

Amirabbas Afzali; Borna Khodabandeh; Ali Rasekh; Mahyar JafariNodeh; Sepehr kazemi; Simon Gottschalk

Aligning Visual Contrastive learning models via Preference Optimization

Amirabbas Afzali, Borna Khodabandeh, Ali Rasekh, Mahyar JafariNodeh, Sepehr kazemi, Simon Gottschalk

TL;DR

This work addresses the vulnerability of vision-language models to typographic attacks and bias by importing Preference Optimization (RLHF, DPO, IPO, KTO) into contrastive learning. It frames contrastive CLIP-style training as a one-step MDP and uses two data streams—preference data $ ext{D}_{ ext{pref}}$ and regularization data $ ext{D}_{ ext{reg}}$—to align behavior with human preferences while preserving pretrained knowledge via KL regularization. The methodology adapts DPO, IPO, and KTO to the non-generative setting, deriving gradient updates that push image embeddings toward the difference between preferred and dispreferred text representations, and introduces a linear transformation on embeddings (with SVD) to control concept directions. Experiments on typographic robustness and gender-bias disentanglement demonstrate improved adversarial robustness and fairness, with analysis showing better retention of pretrained knowledge compared to standard cross-entropy fine-tuning, and the ability to modulate task emphasis via the transformation scale $t$ and hyperparameters. The results suggest that PO can effectively align non-generative models to targeted preferences, offering practical benefits for fairness, robustness, and task-specific alignment in CLIP-like systems.

Abstract

Contrastive learning models have demonstrated impressive abilities to capture semantic similarities by aligning representations in the embedding space. However, their performance can be limited by the quality of the training data and its inherent biases. While Preference Optimization (PO) methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have been applied to align generative models with human preferences, their use in contrastive learning has yet to be explored. This paper introduces a novel method for training contrastive learning models using different PO methods to break down complex concepts. Our method systematically aligns model behavior with desired preferences, enhancing performance on the targeted task. In particular, we focus on enhancing model robustness against typographic attacks and inductive biases, commonly seen in contrastive vision-language models like CLIP. Our experiments demonstrate that models trained using PO outperform standard contrastive learning techniques while retaining their ability to handle adversarial challenges and maintain accuracy on other downstream tasks. This makes our method well-suited for tasks requiring fairness, robustness, and alignment with specific preferences. We evaluate our method for tackling typographic attacks on images and explore its ability to disentangle gender concepts and mitigate gender bias, showcasing the versatility of our approach.

Aligning Visual Contrastive learning models via Preference Optimization

TL;DR

Abstract

Aligning Visual Contrastive learning models via Preference Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (4)