Table of Contents
Fetching ...

SmartSight: Mitigating Hallucination in Video-LLMs Without Compromising Video Understanding via Temporal Attention Collapse

Yiming Sun, Mi Zhang, Feifei Li, Geng Hong, Min Yang

TL;DR

SmartSight tackles perceptual hallucinations in Video-LLMs without sacrificing video understanding by enabling self-reflection through multi-sample generation. It introduces the Temporal Attention Collapse score to quantify hallucination via frame- and segment-level attention distributions and uses a Visual Attention Vanishing Point to enable early, efficient stopping of low-quality candidates. The approach yields substantial hallucination reductions and improved reasoning across multiple open-source Video-LLMs, with strong scalability and efficiency advantages at test time. This training-free method offers a practical, model-agnostic path toward more reliable video-language reasoning in real-world settings.

Abstract

Despite Video Large Language Models having rapidly advanced in recent years, perceptual hallucinations pose a substantial safety risk, which severely restricts their real-world applicability. While several methods for hallucination mitigation have been proposed, they often compromise the model's capacity for video understanding and reasoning. In this work, we propose SmartSight, a pioneering step to address this issue in a training-free manner by leveraging the model's own introspective capabilities. Specifically, SmartSight generates multiple candidate responses to uncover low-hallucinated outputs that are often obscured by standard greedy decoding. It assesses the hallucination of each response using the Temporal Attention Collapse score, which measures whether the model over-focuses on trivial temporal regions of the input video when generating the response. To improve efficiency, SmartSight identifies the Visual Attention Vanishing point, enabling more accurate hallucination estimation and early termination of hallucinated responses, leading to a substantial reduction in decoding cost. Experiments show that SmartSight substantially lowers hallucinations for Qwen2.5-VL-7B by 10.59% on VRIPT-HAL, while simultaneously enhancing video understanding and reasoning, boosting performance on VideoMMMU by up to 8.86%. These results highlight SmartSight's effectiveness in improving the reliability of open-source Video-LLMs.

SmartSight: Mitigating Hallucination in Video-LLMs Without Compromising Video Understanding via Temporal Attention Collapse

TL;DR

SmartSight tackles perceptual hallucinations in Video-LLMs without sacrificing video understanding by enabling self-reflection through multi-sample generation. It introduces the Temporal Attention Collapse score to quantify hallucination via frame- and segment-level attention distributions and uses a Visual Attention Vanishing Point to enable early, efficient stopping of low-quality candidates. The approach yields substantial hallucination reductions and improved reasoning across multiple open-source Video-LLMs, with strong scalability and efficiency advantages at test time. This training-free method offers a practical, model-agnostic path toward more reliable video-language reasoning in real-world settings.

Abstract

Despite Video Large Language Models having rapidly advanced in recent years, perceptual hallucinations pose a substantial safety risk, which severely restricts their real-world applicability. While several methods for hallucination mitigation have been proposed, they often compromise the model's capacity for video understanding and reasoning. In this work, we propose SmartSight, a pioneering step to address this issue in a training-free manner by leveraging the model's own introspective capabilities. Specifically, SmartSight generates multiple candidate responses to uncover low-hallucinated outputs that are often obscured by standard greedy decoding. It assesses the hallucination of each response using the Temporal Attention Collapse score, which measures whether the model over-focuses on trivial temporal regions of the input video when generating the response. To improve efficiency, SmartSight identifies the Visual Attention Vanishing point, enabling more accurate hallucination estimation and early termination of hallucinated responses, leading to a substantial reduction in decoding cost. Experiments show that SmartSight substantially lowers hallucinations for Qwen2.5-VL-7B by 10.59% on VRIPT-HAL, while simultaneously enhancing video understanding and reasoning, boosting performance on VideoMMMU by up to 8.86%. These results highlight SmartSight's effectiveness in improving the reliability of open-source Video-LLMs.

Paper Structure

This paper contains 15 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison between SmartSight and existing methods in terms of hallucinations suppression, video understanding and reasoning. Lighter colors indicate lower computational cost. Existing methods suffer from two main limitations: (1) inferior transferability across models, showing effectiveness only on LLaVA-NEXT-Video; and (2) reducing hallucinations at the cost of impaired video understanding. SmartSight simultaneously suppresses hallucinations and enhances video comprehension, achieving a more favorable balance between accuracy and efficiency.
  • Figure 2: Illustration of hallucination in Video-LLM outputs. Prediction 1 exhibits an over-reliance on the first frame, causing it to overlook crucial information in frame 12 and leading to an incorrect output. Prediction 2 concentrates excessively on visually similar segments (frames 7–8), which results in a misinterpretation of the video content.
  • Figure 3: Comparison between greedy decoding and sampling $N$=10 responses per query. We visualize 100 randomly selected responses from VRIPT-HAL. The black curve shows the hallucination level of responses generated by greedy decoding. The blue and gray curves indicate the least and most hallucinated among the sampled responses. The shaded region illustrates that sampling can produce responses with lower hallucination.
  • Figure 4: Overview of the proposed SmartSight. Given an input video and a textual query, SmartSight generates $N$ responses in parallel. During generation, it dynamically detects the Visual Attention Vanishing Point and estimates hallucination severity using the proposed Temporal Attention Collapse Score. Only one high-quality candidate is retained for continued generation, thereby achieving a favorable balance between efficiency and effectiveness.
  • Figure 5: Results of applying SmartSight to Qwen2.5-VL-7B with different values of $N$. The figure shows that increasing $N$ enables test-time scaling, which is challenging to achieve with prior methods. When $N = 60$, the 7B model achieves performance comparable to the Qwen2.5-VL-32B model and the proprietary Gemini 1.5 Pro.
  • ...and 1 more figures