Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

Chun-Peng Chang; Chen-Yu Wang; Holger Caesar; Alain Pagani

Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

Chun-Peng Chang, Chen-Yu Wang, Holger Caesar, Alain Pagani

TL;DR

This work investigates whether Vision-Language Models, when applied as driving assistants, can response consistantly and understand how present observations shape future outcomes, or whether their outputs merely reflect patterns memorized during training without temporally grounded reasoning.

Abstract

A reliable driving assistant should provide consistent responses based on temporally grounded reasoning derived from observed information. In this work, we investigate whether Vision-Language Models (VLMs), when applied as driving assistants, can response consistantly and understand how present observations shape future outcomes, or whether their outputs merely reflect patterns memorized during training without temporally grounded reasoning. While recent efforts have integrated VLMs into autonomous driving, prior studies typically emphasize scene understanding and instruction generation, implicitly assuming that strong visual interpretation naturally enables consistant future reasoning and thus ensures reliable decision-making, a claim we critically examine. We focus on two major challenges limiting VLM reliability in this setting: response inconsistency, where minor input perturbations yield different answers or, in some cases, responses degenerate toward near-random guessing, and limited temporal reasoning, in which models fail to reason and align sequential events from current observations, often resulting in incorrect or even contradictory responses. Moreover, we find that models with strong visual understanding do not necessarily perform best on tasks requiring temporal reasoning, indicating a tendency to over-rely on pretrained patterns rather than modeling temporal dynamics. To address these issues, we adopt existing evaluation methods and introduce FutureVQA, a human-annotated benchmark dataset specifically designed to assess future scene reasoning. In addition, we propose a simple yet effective self-supervised tuning approach with chain-of-thought reasoning that improves both consistency and temporal reasoning without requiring temporal labels.

Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

TL;DR

Abstract

Paper Structure (30 sections, 6 equations, 16 figures, 7 tables, 3 algorithms)

This paper contains 30 sections, 6 equations, 16 figures, 7 tables, 3 algorithms.

Introduction
Related Work
Problem Formulation and Evaluation
Response Unreliability and Inconsistency
Contradiction and Temporal Misalignment
Formalization.
Evaluation and Metrics
Self-Aligned Future Description.
LLM-as-Judge Evaluation.
FutureVQA Benchmark.
FutureAgent: An Approach for Enhanced Temporal Reasoning
Experiment and Analysis
Evaluation Setup and Implementation Details
Consistency and Reliability of VLMs Response
Prompt-perturbation sensitivity vs. random guessing.
...and 15 more sections

Figures (16)

Figure 1: Reliability failures in VLMs. The figure illustrates three issues: (i) response inconsistency—identical or very similar prompts yield different answers; (ii) contradiction—correct local interpretation but inconsistent future description; and (iii) temporal misalignment—events predicted at incoherent times despite accurate per-frame cues.
Figure 2: Overview of our framework for evaluating reliable temporal reasoning in VLM driving assistants. Left: The agent consumes past frames $V_t$ and a prompt to generate temporally aligned predictions over a variable future horizon. Right (FutureVQA): Benchmark construction combines human and AI contributions: human experts create natural Q/A pairs, while AI performs quality control to ensure answerability and consistency. Bottom (Evaluation): To thoroughly analyze model reliability, we adopt a self-aligned future description setup, where a model’s predicted description is compared to a reference response generated by the same model when the actual future frames are provided. An AI checker is further applied to validate that predictions remain coherent and meaningful. Beyond this, we evaluate consistency under repeated queries and option shuffling, and analyze temporal performance decay to quantify how model reliability changes as the prediction horizon increases.
Figure 3: Example of the FutureVQA task. The VLM is asked to answer questions about future scenes based on predictions, without access to the corresponding future frames.
Figure 4: Proposed self-supervised approach to align temporal events and minimize incorrect or contradictory reasoning. Given a video sequence $V$, we generate detailed descriptions using a pretrained VLM $\psi$ as pseudo reference labels $a^{\text{ref}}_{t + \Delta t}$. We then fine-tune the model $\psi^*$, initialized from $\psi$, using only past frames as input and training it to predict descriptions of unseen future frames $a^{\text{pred}}_{t + \Delta t}$. A weighting function $\lambda(\Delta t)$ adjusts the contribution of each loss term based on the temporal distance $\Delta t$.
Figure 5: Temporal performance decay analysis on the FutureVQA dataset. (a) Accuracy decay across horizons, where solid lines denote four trials and shaded regions indicate fewer trials (1–3). (b) Relationship between regular VQA performance (y-axis) and relative long-horizon preservation (x-axis: Acc@12 divided by regular VQA accuracy). (c) Relationship between regular VQA performance (y-axis) and relative mean preservation (x-axis: $\text{mAcc}_{(1 \to 12s)}$ divided by regular VQA accuracy). Together, these plots show how well models retain their performance when extending from immediate perception to future prediction.
...and 11 more figures

Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

TL;DR

Abstract

Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (16)