Table of Contents
Fetching ...

Self-Supervised Visual Preference Alignment

Ke Zhu, Zheng Ge, Liang Zhao, Xiangyu Zhang

TL;DR

The paper addresses unsupervised preference alignment for vision-language models by introducing SeVa, which constructs preference data from original versus augmented images and trains via Direct Preference Optimization without GPT-4 or human labeling. It connects to contrastive learning and demonstrates that medium-strength, diffusion-based augmentations yield hard negatives that enhance multi-modal reasoning across benchmarks such as MMVet, MMBench, and POPE. SeVa achieves competitive performance relative to GPT-4 on select tasks and exhibits reduced hallucinations and improved alignment with user intentions, while remaining data-efficient and scalable. Overall, the work provides a practical, scalable pathway to align VLMs with user goals using self-generated preference data.

Abstract

This paper makes the first attempt towards unsupervised preference alignment in Vision-Language Models (VLMs). We generate chosen and rejected responses with regard to the original and augmented image pairs, and conduct preference alignment with direct preference optimization. It is based on a core idea: properly designed augmentation to the image input will induce VLM to generate false but hard negative responses, which helps the model to learn from and produce more robust and powerful answers. The whole pipeline no longer hinges on supervision from GPT-4 or human involvement during alignment, and is highly efficient with few lines of code. With only 8k randomly sampled unsupervised data, it achieves 90\% relative score to GPT-4 on complex reasoning in LLaVA-Bench, and improves LLaVA-7B/13B by 6.7\%/5.6\% score on complex multi-modal benchmark MM-Vet. Visualizations shows its improved ability to align with user-intentions. A series of ablations are firmly conducted to reveal the latent mechanism of the approach, which also indicates its potential towards further scaling. Code are available in https://github.com/Kevinz-code/SeVa.

Self-Supervised Visual Preference Alignment

TL;DR

The paper addresses unsupervised preference alignment for vision-language models by introducing SeVa, which constructs preference data from original versus augmented images and trains via Direct Preference Optimization without GPT-4 or human labeling. It connects to contrastive learning and demonstrates that medium-strength, diffusion-based augmentations yield hard negatives that enhance multi-modal reasoning across benchmarks such as MMVet, MMBench, and POPE. SeVa achieves competitive performance relative to GPT-4 on select tasks and exhibits reduced hallucinations and improved alignment with user intentions, while remaining data-efficient and scalable. Overall, the work provides a practical, scalable pathway to align VLMs with user goals using self-generated preference data.

Abstract

This paper makes the first attempt towards unsupervised preference alignment in Vision-Language Models (VLMs). We generate chosen and rejected responses with regard to the original and augmented image pairs, and conduct preference alignment with direct preference optimization. It is based on a core idea: properly designed augmentation to the image input will induce VLM to generate false but hard negative responses, which helps the model to learn from and produce more robust and powerful answers. The whole pipeline no longer hinges on supervision from GPT-4 or human involvement during alignment, and is highly efficient with few lines of code. With only 8k randomly sampled unsupervised data, it achieves 90\% relative score to GPT-4 on complex reasoning in LLaVA-Bench, and improves LLaVA-7B/13B by 6.7\%/5.6\% score on complex multi-modal benchmark MM-Vet. Visualizations shows its improved ability to align with user-intentions. A series of ablations are firmly conducted to reveal the latent mechanism of the approach, which also indicates its potential towards further scaling. Code are available in https://github.com/Kevinz-code/SeVa.
Paper Structure (18 sections, 15 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 15 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: Illustration of the baseline LLaVA-13B (v1.5) and the proposed SeVa-13B. Here we demonstrate three variants of SeVa with different sampled seed to obtain the un-labeled dataset (the image-text pairs used for DPO sample generation, cf. Alg. \ref{['alg:code']}).
  • Figure 2: The test-time image augmentations (TTA) plugged into LLaVA-1.5 on three benchmarks. We involve standard augmentation: RandFlip, RandomResizedCrop ('RRCrop'), RandomCrop, CenterCrop, RandomAffine, RandomInvert and AutoAug; diffusion noise augmentation: Diffusion-Weak ('W') and Diffusion-Strong ('S'); mixtures: strategies adopted in MOCO MOCO, BYOL BYOL and SimCLR SimCLR.
  • Figure 3: The pipeline of SeVa. For each image $I$ in the selected dataset, we transform it with data augmentation $T$ to obtain the distorted one, while keeping a copy of the original image to form a pair. The shared questions are acted on the paired images to get the chosen and rejected responses, respectively, which undergo a data collection (e.g., filtering) process before the DPO training. In the left part, incorrect words or sentences are red color coded, while in the right part (the improved version of the model), we highlight excellent content with bold phase. Note that in the picture, we show the same image both for training and for testing, but actually the data distribution between them are different (cf. Sec. \ref{['sec:exp-settings']}). This figure is best viewed in color.
  • Figure 4: Illustration of representative questions in five different database from LLaVA665k LLaVa1.5. In our main experiment, we adopt a combination of 'textvqa' and 'ocrvqa'. The results of applying the other 3 question types in SeVa can be found in Table \ref{['tab:ablate-dataset']}.
  • Figure 5: The pair-wise competition and output sentence length (by token) in LLaVA$^\text{W}$ and MMVet, respectively. We compare between SeVa and LLaVA-1.5 models in 7B and 13B settings.
  • ...and 6 more figures