Table of Contents
Fetching ...

VaPR -- Vision-language Preference alignment for Reasoning

Rohan Wadhawan, Fabrice Y Harel-Canada, Zi-Yi Dou, Suhaila Shakiah, Robinson Piramuthu, Nanyun Peng

TL;DR

VaPR introduces a hard-negative preference data generation framework for vision-language models to address biases in synthetic annotations and improve reasoning. By editing ground-truth responses with task-aware perturbations while preserving style and length, VaPR builds a 30K sample dataset that, when used with Direct Preference Optimization, yields consistent gains across LVLM families and ten benchmarks. The approach reduces Yes bias in binary questions and scales well, with VaPR-OS demonstrating near-parity performance using open-source editors. The work also provides comprehensive analyses comparing VaPR to prior datasets, and shows generalization to open-source LLM editors, establishing a practical, scalable path for bias-resistant preference alignment in multimodal models.

Abstract

Preference finetuning methods like Direct Preference Optimization (DPO) with AI-generated feedback have shown promise in aligning Large Vision-Language Models (LVLMs) with human preferences. However, existing techniques overlook the prevalence of noise in synthetic preference annotations in the form of stylistic and length biases. To this end, we introduce a hard-negative response generation framework based on LLM-guided response editing, that produces rejected responses with targeted errors, maintaining stylistic and length similarity to the accepted ones. Using this framework, we develop the VaPR dataset, comprising 30K high-quality samples, to finetune three LVLM families: LLaVA-V1.5, Qwen2VL & Qwen2.5VL (2B-13B sizes). Our VaPR models deliver significant performance improvements across ten benchmarks, achieving average gains of 6.5% (LLaVA), 4.0% (Qwen2VL), and 1.5% (Qwen2.5VL), with notable improvements on reasoning tasks. A scaling analysis shows that performance consistently improves with data size, with LLaVA models benefiting even at smaller scales. Moreover, VaPR reduces the tendency to answer "Yes" in binary questions - addressing a common failure mode in LVLMs like LLaVA. Lastly, we show that the framework generalizes to open-source LLMs as editors, with models trained on VaPR-OS achieving ~99% of the performance of models trained on \name, which is synthesized using GPT-4o. Our data, models, and code can be found on the project page https://vap-r.github.io

VaPR -- Vision-language Preference alignment for Reasoning

TL;DR

VaPR introduces a hard-negative preference data generation framework for vision-language models to address biases in synthetic annotations and improve reasoning. By editing ground-truth responses with task-aware perturbations while preserving style and length, VaPR builds a 30K sample dataset that, when used with Direct Preference Optimization, yields consistent gains across LVLM families and ten benchmarks. The approach reduces Yes bias in binary questions and scales well, with VaPR-OS demonstrating near-parity performance using open-source editors. The work also provides comprehensive analyses comparing VaPR to prior datasets, and shows generalization to open-source LLM editors, establishing a practical, scalable path for bias-resistant preference alignment in multimodal models.

Abstract

Preference finetuning methods like Direct Preference Optimization (DPO) with AI-generated feedback have shown promise in aligning Large Vision-Language Models (LVLMs) with human preferences. However, existing techniques overlook the prevalence of noise in synthetic preference annotations in the form of stylistic and length biases. To this end, we introduce a hard-negative response generation framework based on LLM-guided response editing, that produces rejected responses with targeted errors, maintaining stylistic and length similarity to the accepted ones. Using this framework, we develop the VaPR dataset, comprising 30K high-quality samples, to finetune three LVLM families: LLaVA-V1.5, Qwen2VL & Qwen2.5VL (2B-13B sizes). Our VaPR models deliver significant performance improvements across ten benchmarks, achieving average gains of 6.5% (LLaVA), 4.0% (Qwen2VL), and 1.5% (Qwen2.5VL), with notable improvements on reasoning tasks. A scaling analysis shows that performance consistently improves with data size, with LLaVA models benefiting even at smaller scales. Moreover, VaPR reduces the tendency to answer "Yes" in binary questions - addressing a common failure mode in LVLMs like LLaVA. Lastly, we show that the framework generalizes to open-source LLMs as editors, with models trained on VaPR-OS achieving ~99% of the performance of models trained on \name, which is synthesized using GPT-4o. Our data, models, and code can be found on the project page https://vap-r.github.io

Paper Structure

This paper contains 70 sections, 4 equations, 27 figures, 21 tables.

Figures (27)

  • Figure 1: Examples from the VaPR hard-negative generation framework show instruction, image, chosen response, and three rejected variants. (a) fine-grained perception and counting capability, and (b) spatial reasoning. VaPR introduces targeted error - modifying only task-relevant spans - while length-biased rejections add verbose description, and style-biased ones alter style and content. Blue highlights relevant spans in chosen response, green shows VaPR perturbations, and red indicates stylistic or length-biased edits.
  • Figure 2: VaPR: A three-stage pipeline that generates 30K hard-negative preference pairs from LLaVA-v1.5-665K SFT llava-v15. Stage 1: Filter out irrelevant samples (e.g., MCQs). Stage 2: Categorize remaining samples based on task. Stage 3: Use task-specific prompts (with optional penalty lists) to produce stylistically and length-wise similar but content-distinct negative responses.
  • Figure 3: VaPR task distribution
  • Figure 4: Comparison of preference datasets. (a) Average reference model log-probabilities for chosen vs. rejected responses across VaPR, SIMA, and POVID - lower values indicate lower reference likelihood. (b) Reward accuracy trends over training steps show that SIMA improves gradually while POVID saturates quickly.
  • Figure 5: Performance scaling of VaPR models with 3K, 10K, and 30K samples, shown as % improvement over base instruct models. Note: X-axis spacing between 3K, 10K, and 30K is not uniform.
  • ...and 22 more figures