Selective Training for Large Vision Language Models via Visual Information Gain

Seulbi Lee; Sangheum Hwang

Selective Training for Large Vision Language Models via Visual Information Gain

Seulbi Lee, Sangheum Hwang

TL;DR

This work tackles language bias in large vision-language models by introducing Visual Information Gain (VIG), a perplexity-based metric that quantifies how much visual input reduces model uncertainty. VIG enables fine-grained analysis at both the sample and token levels and is used to implement a VIG-guided selective training regime that prioritizes visually informative data. Empirically, VIG-guided training yields improved visual grounding and reduced language bias across vision-understanding and hallucination benchmarks while using substantially less supervision, and it complements existing visual grounding methods without architectural changes. The approach proves particularly effective for smaller models and demonstrates increased attention to visual tokens and robustness to textual corruption, with practical considerations around the computation of VIG scores being reusable across training runs.

Abstract

Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.

Selective Training for Large Vision Language Models via Visual Information Gain

TL;DR

Abstract

Paper Structure (29 sections, 9 equations, 9 figures, 12 tables)

This paper contains 29 sections, 9 equations, 9 figures, 12 tables.

Introduction
Related Work
Visual Information Gain
Preliminary
Definition of VIG
Analysis
VIG-Guided Selective Training
Experiment
Tasks and Benchmarks
Overall Performance and Data Efficiency
Comparison with Existing Methods
Analysis
Ablation Study
Conclusion
Details of Benchmarks
...and 14 more sections

Figures (9)

Figure 1: Examples of LLaVA-1.5 instruction tuning data. The dataset includes both samples and tokens with very different levels of visual dependency: some questions can be answered without looking at the image, whereas others need fine-grained visual details (highlighted in green).
Figure 2: VIG distribution across benchmarks. Blue benchmarks (COCO, POPE) show stronger multimodal interaction, while red benchmarks (GQA, SQA) exhibit weaker visual dependency.
Figure 3: Visualizing the token-level VIGs. Each point shows a token's prediction loss with ($x$-axis) and without ($y$-axis) visual input. The color encodes the token-level loss difference ($y-x$).
Figure 4: Attention fraction allocated to visual tokens. Compared to LLaVA-1.5 7B, VIG training assigns significantly more attention to visual tokens across all layers.
Figure 5: Evaluation of text reliance under textual corruption. Base: accuracy on clean inputs. Corruption: accuracy when the same image is paired with a corrupted caption containing a conflicting description. Norm: corruption accuracy normalized by the corresponding Base (Corruption/Base).
...and 4 more figures

Selective Training for Large Vision Language Models via Visual Information Gain

TL;DR

Abstract

Selective Training for Large Vision Language Models via Visual Information Gain

Authors

TL;DR

Abstract

Table of Contents

Figures (9)