Table of Contents
Fetching ...

VisualDeltas: Learning Preferences from Visual Quality Perturbations

Hailiang Huang, Yihao Liu, Shengyue Guan, Haoze Li, Sujian Li

TL;DR

VisualDeltas, a lightweight preference-learning framework that extracts supervision from visual quality variations in multimodal data, consistently outperforms rejection-sampling fine-tuning and improves generalization, and extends naturally to a range of visual degradations.

Abstract

We present VisualDeltas, a lightweight preference-learning framework that extracts supervision from visual quality variations in multimodal data. By leveraging the systematic impact of image quality on visual perception and reasoning, VisualDeltas induces informative preference signals without relying on human annotations or external teachers. The framework supports both label-free and label-based regimes, enabling flexible use of available supervision when present. Across diverse multimodal benchmarks and model scales, VisualDeltas consistently outperforms rejection-sampling fine-tuning and improves generalization, and extends naturally to a range of visual degradations.

VisualDeltas: Learning Preferences from Visual Quality Perturbations

TL;DR

VisualDeltas, a lightweight preference-learning framework that extracts supervision from visual quality variations in multimodal data, consistently outperforms rejection-sampling fine-tuning and improves generalization, and extends naturally to a range of visual degradations.

Abstract

We present VisualDeltas, a lightweight preference-learning framework that extracts supervision from visual quality variations in multimodal data. By leveraging the systematic impact of image quality on visual perception and reasoning, VisualDeltas induces informative preference signals without relying on human annotations or external teachers. The framework supports both label-free and label-based regimes, enabling flexible use of available supervision when present. Across diverse multimodal benchmarks and model scales, VisualDeltas consistently outperforms rejection-sampling fine-tuning and improves generalization, and extends naturally to a range of visual degradations.
Paper Structure (76 sections, 5 equations, 8 figures, 10 tables)

This paper contains 76 sections, 5 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: DPO pair construction via multimodal input quality perturbation. HQ and LQ inputs for the same multimodal QA task induce correct and incorrect reasoning, forming a natural preference pair.
  • Figure 2: Overview of VisualDeltas. For each user query, the model generates paired responses under both HQ and LQ image views. We compare with SFT (supervised fine-tuning on HQ-correct responses only) and two VisualDeltas variants: VD-LF (label-free) uses all HQ vs. LQ pairs without correctness filtering, while VD-LB (label-based) selects only HQ-correct vs. LQ-wrong pairs. Both VisualDeltas variants apply DPO exclusively with HQ context during training.
  • Figure 3: Sample category distribution on HiTab. Quality-Sensitive samples (HQ correct, LQ wrong) comprise 38.3% of the dataset, providing clean HQ$\succ$LQ preference pairs for VD-LB training.
  • Figure 4: Response length distribution by category. LQ responses are consistently longer than HQ responses, suggesting that degraded visual inputs trigger compensatory but ineffective reasoning. After DPO training on VisualDeltas preference pairs, the HQ distribution shifts left and becomes sharper, with reduced mean and median token lengths. This demonstrates that DPO improves reasoning efficiency---models learn to produce more concise responses while maintaining higher accuracy.
  • Figure 5: Visual Examples from Different Datasets. (a)-(b) Table datasets (HiTab, WikiTableQuestions) feature complex, dense structures requiring precise visual grounding. (c)-(d) Natural image datasets (VQA, GQA) contain real-world scenes with varying complexity. (e) MathVision contains simple mathematical expressions and diagrams that rely more on reasoning than visual details. This visual complexity hierarchy explains why HiTab/WTQ show strong resolution dependence while MathVision remains resolution-independent.
  • ...and 3 more figures