Table of Contents
Fetching ...

Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models

Md Ashikur Rahman, Md Arifur Rahman, Niamul Hassan Samin, Abdullah Ibne Hanif Arean, Juena Ahmed Noshin

TL;DR

A behavioral law of long-horizon vision-language models is uncovered: models that maintain temporally grounded beliefs generalize better, an empirically measurable property that quantifies whether a model's intermediate reasoning remains consistent with the evolving visual state.

Abstract

We uncover a behavioral law of long-horizon vision-language models: models that maintain temporally grounded beliefs generalize better. Standard benchmarks measure only final-answer accuracy, which obscures how models use visual information; a model can guess correctly while its step-by-step reasoning is entirely unanchored to the visual input. We formalize this as behavioral faithfulness over long horizons, an empirically measurable property that quantifies whether a model's intermediate reasoning remains consistent with the evolving visual state. Across eight models on three long-horizon benchmarks, we demonstrate that temporal grounding quality is a leading indicator of robustness: the Step Grounding Rate (SGR) predicts out-of-distribution retention with $r = 0.83$ (permutation test $p = 0.003$), a relationship that holds within capacity-matched models and cannot be explained by scale or in-distribution accuracy. Critically, grounding quality varies by up to 10.8 percentage points within parameter-matched 7B models despite similar accuracy, revealing it as an independent axis of model capability. Multiple robustness checks confirm the signal reflects genuine visual reliance: counterfactual traces drop SGR by 26--41 percentage points, cross-architecture verifiers agree at $ρ= 0.96$, random reasoning scores near chance ($\sim 18\%$), and the predictor remains strong even without explicit reasoning disclosure ($r = 0.78$).

Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models

TL;DR

A behavioral law of long-horizon vision-language models is uncovered: models that maintain temporally grounded beliefs generalize better, an empirically measurable property that quantifies whether a model's intermediate reasoning remains consistent with the evolving visual state.

Abstract

We uncover a behavioral law of long-horizon vision-language models: models that maintain temporally grounded beliefs generalize better. Standard benchmarks measure only final-answer accuracy, which obscures how models use visual information; a model can guess correctly while its step-by-step reasoning is entirely unanchored to the visual input. We formalize this as behavioral faithfulness over long horizons, an empirically measurable property that quantifies whether a model's intermediate reasoning remains consistent with the evolving visual state. Across eight models on three long-horizon benchmarks, we demonstrate that temporal grounding quality is a leading indicator of robustness: the Step Grounding Rate (SGR) predicts out-of-distribution retention with (permutation test ), a relationship that holds within capacity-matched models and cannot be explained by scale or in-distribution accuracy. Critically, grounding quality varies by up to 10.8 percentage points within parameter-matched 7B models despite similar accuracy, revealing it as an independent axis of model capability. Multiple robustness checks confirm the signal reflects genuine visual reliance: counterfactual traces drop SGR by 26--41 percentage points, cross-architecture verifiers agree at , random reasoning scores near chance (), and the predictor remains strong even without explicit reasoning disclosure ().
Paper Structure (27 sections, 4 equations, 3 figures, 3 tables)

This paper contains 27 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Four-stage operationalization of behavioral faithfulness. (1) Extract reasoning traces; (2) verify grounding against visual evidence; (3) track belief consistency; (4) apply controlled perturbations. Outputs: SGR, TCS, HR, VRS.
  • Figure 2: Results overview. Left: Models with higher SGR rely more selectively on visual evidence (VRS). Right: Temporal grounding degrades over task progress, yet in-distribution SGR strongly predicts OOD retention, a relationship that holds within the capacity-matched 7B cluster.
  • Figure 3: High accuracy with imperfect grounding. The model correctly answers "Red" but Step 2 is unsupported (person walks, not crosses). Standard evaluation misses this failure; our framework reveals SGR$=$0.67, HR$=$0.33.