Table of Contents
Fetching ...

Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis

Jiyun Bae, Hyunjong Ok, Sangwoo Mo, Jaeho Lee

TL;DR

This work investigates whether test-time compute scaling observed as inverse scaling in language models extends to vision-language models when visual distractors are present. By introducing Idis, a distractor-centered VQA dataset with semantic, numeric, and spatially varied distractors, the study shows that visual distractors degrade accuracy without increasing reasoning length, unlike textual distractors in LMs. An attribute-trace analysis reveals that the model's reasoning becomes biased toward distractor-related attributes, and the distractor area and semantics strongly influence performance. The findings generalize to visual bias benchmarks like Waterbirds, where reasoning VLMs amplify bias, but a simple prompt-based debiasing strategy that emphasizes foreground attributes improves robustness without retraining. Overall, the work provides a distractor-centric framework for interpreting and mitigating inference-time biases in multimodal reasoning systems.

Abstract

How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.

Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis

TL;DR

This work investigates whether test-time compute scaling observed as inverse scaling in language models extends to vision-language models when visual distractors are present. By introducing Idis, a distractor-centered VQA dataset with semantic, numeric, and spatially varied distractors, the study shows that visual distractors degrade accuracy without increasing reasoning length, unlike textual distractors in LMs. An attribute-trace analysis reveals that the model's reasoning becomes biased toward distractor-related attributes, and the distractor area and semantics strongly influence performance. The findings generalize to visual bias benchmarks like Waterbirds, where reasoning VLMs amplify bias, but a simple prompt-based debiasing strategy that emphasizes foreground attributes improves robustness without retraining. Overall, the work provides a distractor-centric framework for interpreting and mitigating inference-time biases in multimodal reasoning systems.

Abstract

How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.

Paper Structure

This paper contains 33 sections, 3 equations, 18 figures, 8 tables.

Figures (18)

  • Figure 1: Inverse scaling in reasoning LMs vs. VLMs. (a) In reasoning LMs, adding more textual distractors increases the reasoning length and decreases the accuracy, but the overall inverse scaling curve remains similar. (b) In reasoning VLMs, adding visual distractors decreases the accuracy but does not increase the reasoning length. Instead, the entire length-accuracy curve is shifted downward. (c) The strength of inverse scaling depends on the semantics of visual distractors (e.g., aligned, irrelevant, conflicting), with accuracy drop being particularly severe when distractors are negatively spuriously correlated with the target object.
  • Figure 2: Larger distractor areas increase distractor-related attributes and lead to performance degradation. As the relative spatial scale of distractors to target objects grows (from (a) being smallest to (c) being largest), the proportion of distractor-related attributes within the reasoning trace increases. On the other hand, the total number of attributes remains similar. This leads to a downward shift of the inverse scaling curve.
  • Figure 3: Dataset generation pipeline for Idis. We generate images with distractors by editing an image with object-free background. In particular, we prompt Gemini 2.5 Flash Image with instructions to add distractor objects selected from the set of aligned, conflicting, or irrelevant distractors. Here, each set is defined by a set of keywords, which are either correlated with the target object class (aligned), correlated with other classes (conflicting), or not correlated (irrelevant). This pipeline can generate multiple images with various choices of visual distractors, while keeping the target object consistent. Yellow boxes indicate distractor regions (not included in actual images).
  • Figure 4: Inverse scaling in reasoning VLMs, with various types of semantic relationship of visual distractors to the target object. We examine the inverse scaling trend across four reasoning VLMs, comparing the no distractor baseline with the cases of inserting aligned, irrelevant, or conflicting distractor (four distractors each). While no distractor and aligned distractors exhibit relatively stable or mild performance drop, irrelevant distractors induce steeper accuracy drops, and conflicting distractors cause the largest declines with downward shifts. This reveals that longer reasoning chains amplify vulnerability to distractor interference, most notably under conflicting distractors.
  • Figure 5: Adding more distractors drops accuracy without extending reasoning length. (a) shows that the distractor count does not have any significant or consistent effect on the reasoning length over various models. In contrast, (b) shows that accuracy drops as we add more distractors, consistently over all models.
  • ...and 13 more figures