Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

Peng Sun; Huawen Shen; Yi Ban; Tianfan Fu; Yanbo Wang; Yuqiang Li

Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

Peng Sun, Huawen Shen, Yi Ban, Tianfan Fu, Yanbo Wang, Yuqiang Li

TL;DR

CVS leverages a frozen VLLM as an evaluator and measures the discrepancy in answer validity with and without conditioning on the question, enabling the identification of samples that require vision-language joint reasoning while filtering semantic-conflict noise.

Abstract

Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the effectiveness of multimodal learning. Prior data selection methods often rely on costly proxy model training and focus on difficulty or diversity, failing to capture a sample's true contribution to vision-language joint reasoning. In this paper, we propose CVS, a training-free data selection method based on the insight that, for high-quality multimodal samples, introducing the question should substantially alter the model's assessment of answer validity given an image. CVS leverages a frozen VLLM as an evaluator and measures the discrepancy in answer validity with and without conditioning on the question, enabling the identification of samples that require vision-language joint reasoning while filtering semantic-conflict noise. Experiments on Vision-Flan and The Cauldron show that CVS achieves solid performance across datasets. On Vision-Flan, CVS outperforms full-data training by 3.5% and 4.8% using only 10% and 15% of the data, respectively, and remains robust on the highly heterogeneous Cauldron dataset. Moreover, CVS reduces computational cost by 17.3% and 44.4% compared to COINCIDE and XMAS.

Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

TL;DR

Abstract

Paper Structure (39 sections, 11 equations, 9 figures, 4 tables)

This paper contains 39 sections, 11 equations, 9 figures, 4 tables.

Introduction
Related Work
Score-based Methods
Clustering-based Methods
Method
Preliminaries
VLLM.
A Distributional Perspective on Data Selection.
Problem Formulation
Conditional Verdict Shift (CVS)
Conditional Affirmation Shift
Conditional Rejection Shift
Filtering Protocol
Preference for Hard Positive Samples
Experiments
...and 24 more sections

Figures (9)

Figure 1: (a) Prompt construction for CVS. (b) CVS pipeline: a frozen VLLM evaluates conditional affirmation and rejection shifts to filter noisy samples and prioritize hard positives near the decision boundary.
Figure 2: Performance comparison between CVS and baseline methods on Vision-Flan. CVS achieves the best performance at 10% and 15% sampling ratios, even surpassing full-data training.
Figure 3: Performance comparison between CVS and baseline methods on The Cauldron. CVS shows robust performance across different sampling ratios.
Figure 4: Effect of CVS score ranges on model performance. Low strategy consistently yields the largest gains across all budgets.
Figure 5: Robustness of CVS to evaluator architecture and scale.
...and 4 more figures

Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

TL;DR

Abstract

Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

Authors

TL;DR

Abstract

Table of Contents

Figures (9)