Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models

Md Ashikur Rahman; Md Arifur Rahman; Niamul Hassan Samin; Abdullah Ibne Hanif Arean; Juena Ahmed Noshin

Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models

Md Ashikur Rahman, Md Arifur Rahman, Niamul Hassan Samin, Abdullah Ibne Hanif Arean, Juena Ahmed Noshin

TL;DR

A behavioral law of long-horizon vision-language models is uncovered: models that maintain temporally grounded beliefs generalize better, an empirically measurable property that quantifies whether a model's intermediate reasoning remains consistent with the evolving visual state.

Abstract

We uncover a behavioral law of long-horizon vision-language models: models that maintain temporally grounded beliefs generalize better. Standard benchmarks measure only final-answer accuracy, which obscures how models use visual information; a model can guess correctly while its step-by-step reasoning is entirely unanchored to the visual input. We formalize this as behavioral faithfulness over long horizons, an empirically measurable property that quantifies whether a model's intermediate reasoning remains consistent with the evolving visual state. Across eight models on three long-horizon benchmarks, we demonstrate that temporal grounding quality is a leading indicator of robustness: the Step Grounding Rate (SGR) predicts out-of-distribution retention with $r = 0.83$ (permutation test $p = 0.003$), a relationship that holds within capacity-matched models and cannot be explained by scale or in-distribution accuracy. Critically, grounding quality varies by up to 10.8 percentage points within parameter-matched 7B models despite similar accuracy, revealing it as an independent axis of model capability. Multiple robustness checks confirm the signal reflects genuine visual reliance: counterfactual traces drop SGR by 26--41 percentage points, cross-architecture verifiers agree at $ρ= 0.96$, random reasoning scores near chance ($\sim 18\%$), and the predictor remains strong even without explicit reasoning disclosure ($r = 0.78$).

Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models

TL;DR

Abstract

(permutation test

), a relationship that holds within capacity-matched models and cannot be explained by scale or in-distribution accuracy. Critically, grounding quality varies by up to 10.8 percentage points within parameter-matched 7B models despite similar accuracy, revealing it as an independent axis of model capability. Multiple robustness checks confirm the signal reflects genuine visual reliance: counterfactual traces drop SGR by 26--41 percentage points, cross-architecture verifiers agree at

, random reasoning scores near chance (

), and the predictor remains strong even without explicit reasoning disclosure (

Paper Structure (27 sections, 4 equations, 3 figures, 3 tables)

This paper contains 27 sections, 4 equations, 3 figures, 3 tables.

Introduction
Related Work
Operationalizing Behavioral Faithfulness
Reasoning Extraction
Visual Grounding Verification Pipeline
Verification for Embodied Tasks
Handling Unverifiable Steps
Visual Belief Tracking
Faithfulness in Reasoning
Controlled Visual Perturbations
Experimental Setup
Datasets and Benchmarks
Models Evaluated
Implementation Details
Baselines
...and 12 more sections

Figures (3)

Figure 1: Four-stage operationalization of behavioral faithfulness. (1) Extract reasoning traces; (2) verify grounding against visual evidence; (3) track belief consistency; (4) apply controlled perturbations. Outputs: SGR, TCS, HR, VRS.
Figure 2: Results overview. Left: Models with higher SGR rely more selectively on visual evidence (VRS). Right: Temporal grounding degrades over task progress, yet in-distribution SGR strongly predicts OOD retention, a relationship that holds within the capacity-matched 7B cluster.
Figure 3: High accuracy with imperfect grounding. The model correctly answers "Red" but Step 2 is unsupported (person walks, not crosses). Standard evaluation misses this failure; our framework reveals SGR$=$0.67, HR$=$0.33.

Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models

TL;DR

Abstract

Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)