Table of Contents
Fetching ...

DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation

Zexin Lin, Hawen Wan, Yebin Zhong, Xiaoqiang

TL;DR

DIQ-H introduces the first benchmark for evaluating Vision-Language Models under dynamic temporal degradation, focusing on hallucination persistence, recovery, and temporal consistency in continuous video streams. The framework combines physics-based degradations, a multi-turn QA paradigm, and an Uncertainty-Guided Iterative Refinement (UIR) pipeline to generate reliable pseudo-ground truth at scale. Experimental results across 16 VLMs show substantial robustness gaps, with GPT-4o excelling in recovery and temporal consistency, while open-source models struggle with temporal stability. The work provides a practical platform to assess and improve longitudinal multimodal reasoning for safety-critical applications.

Abstract

Vision-Language Models (VLMs) deployed in safety-critical applications such as autonomous driving must handle continuous visual streams under imperfect conditions. However, existing benchmarks focus on static, high-quality images and ignore temporal degradation and error propagation, which are critical failure modes where transient visual corruption induces hallucinations that persist across subsequent frames. We introduce DIQ-H, the first benchmark for evaluating VLM robustness under dynamic visual degradation in temporal sequences. DIQ-H applies physics-based corruptions including motion blur, sensor noise, and compression artifacts, and measures hallucination persistence, error recovery, and temporal consistency through multi-turn question-answering tasks. To enable scalable annotation, we propose Uncertainty-Guided Iterative Refinement (UIR), which generates reliable pseudo-ground-truth using lightweight VLMs with uncertainty filtering, achieving a 15.3 percent accuracy improvement. Experiments on 16 state-of-the-art VLMs reveal substantial robustness gaps: even advanced models such as GPT-4o achieve only a 78.5 percent recovery rate, while open-source models struggle with temporal consistency at less than 60 percent. DIQ-H provides a comprehensive platform for evaluating VLM reliability in real-world deployments.

DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation

TL;DR

DIQ-H introduces the first benchmark for evaluating Vision-Language Models under dynamic temporal degradation, focusing on hallucination persistence, recovery, and temporal consistency in continuous video streams. The framework combines physics-based degradations, a multi-turn QA paradigm, and an Uncertainty-Guided Iterative Refinement (UIR) pipeline to generate reliable pseudo-ground truth at scale. Experimental results across 16 VLMs show substantial robustness gaps, with GPT-4o excelling in recovery and temporal consistency, while open-source models struggle with temporal stability. The work provides a practical platform to assess and improve longitudinal multimodal reasoning for safety-critical applications.

Abstract

Vision-Language Models (VLMs) deployed in safety-critical applications such as autonomous driving must handle continuous visual streams under imperfect conditions. However, existing benchmarks focus on static, high-quality images and ignore temporal degradation and error propagation, which are critical failure modes where transient visual corruption induces hallucinations that persist across subsequent frames. We introduce DIQ-H, the first benchmark for evaluating VLM robustness under dynamic visual degradation in temporal sequences. DIQ-H applies physics-based corruptions including motion blur, sensor noise, and compression artifacts, and measures hallucination persistence, error recovery, and temporal consistency through multi-turn question-answering tasks. To enable scalable annotation, we propose Uncertainty-Guided Iterative Refinement (UIR), which generates reliable pseudo-ground-truth using lightweight VLMs with uncertainty filtering, achieving a 15.3 percent accuracy improvement. Experiments on 16 state-of-the-art VLMs reveal substantial robustness gaps: even advanced models such as GPT-4o achieve only a 78.5 percent recovery rate, while open-source models struggle with temporal consistency at less than 60 percent. DIQ-H provides a comprehensive platform for evaluating VLM reliability in real-world deployments.

Paper Structure

This paper contains 25 sections, 13 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Overview of motivation and approach. (a) VLMs hallucinate under degradation. (b) Existing benchmarks ignore temporal error propagation. (c) DIQ-H evaluates hallucination persistence and recovery under dynamic degradation.
  • Figure 2: Overview of the DIQ-H evaluation framework. The Multi-Agent Benchmark Generator (left) creates temporally degraded sequences through coordinated Degradation Simulator, Task Designer, and Difficulty Calibrator agents. The Tested VLM (center) processes these sequences, with performance metrics fed back for adaptive difficulty control. The UIR Module (right) generates reliable pseudo-ground truth annotations through uncertainty-guided filtering.
  • Figure 3: Illustration of temporal error propagation in VLMs. A transient degradation at frame $t=3$ causes the model to hallucinate a "blue truck" instead of the actual "red car." Even after visual quality is restored ($t \geq 4$), the hallucinated belief persists, demonstrating cognitive inertia. The DIQ-H benchmark specifically measures a model's ability to recover from such propagated errors.
  • Figure 4: Visualization of the three primary degradation types at varying severity levels. Each column shows the same scene under increasing degradation intensity. Motion blur (Row 2) introduces directional streaking from simulated camera motion. Sensor noise (Row 3) adds ISO-dependent Poisson-Gaussian artifacts. Compression (Row 4) produces blocking and ringing from aggressive H.265 encoding.
  • Figure 5: The Uncertainty-Guided Iterative Refinement (UIR) pipeline. Input images undergo $K$ perturbed inferences through the lightweight VLM. Jensen-Shannon divergence and Hodges-Lehmann estimation quantify output uncertainty. Responses below threshold $\tau$ are accepted as pseudo-GT; uncertain outputs trigger adaptive dropout refinement until convergence.
  • ...and 2 more figures