Table of Contents
Fetching ...

MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations

Kyungho Bae, Jinhyung Kim, Sihaeng Lee, Soonyoung Lee, Gunhee Lee, Jinwoo Choi

TL;DR

MASH-VLM targets action-scene hallucination in Video-LLMs by disentangling spatial and temporal representations inside the LLM. It introduces DST-attention, which masks direct interactions between spatial and temporal tokens, and Harmonic-RoPE, which expands positional embeddings to balance distances among token types. The authors establish UNSCENE, a benchmark of 1,320 videos with 4,078 QA pairs to quantify hallucinations, and demonstrate state-of-the-art performance on UNSCENE and other video-understanding benchmarks. The approach improves reliability of video reasoning in complex scenes, reducing hallucinations while maintaining strong generalization across tasks. These advances advance trustworthy Video-LLMs for diverse applications in AI research and real-world video understanding.

Abstract

In this work, we tackle action-scene hallucination in Video Large Language Models (Video-LLMs), where models incorrectly predict actions based on the scene context or scenes based on observed actions. We observe that existing Video-LLMs often suffer from action-scene hallucination due to two main factors. First, existing Video-LLMs intermingle spatial and temporal features by applying an attention operation across all tokens. Second, they use the standard Rotary Position Embedding (RoPE), which causes the text tokens to overemphasize certain types of tokens depending on their sequential orders. To address these issues, we introduce MASH-VLM, Mitigating Action-Scene Hallucination in Video-LLMs through disentangled spatial-temporal representations. Our approach includes two key innovations: (1) DST-attention, a novel attention mechanism that disentangles the spatial and temporal tokens within the LLM by using masked attention to restrict direct interactions between the spatial and temporal tokens; (2) Harmonic-RoPE, which extends the dimensionality of the positional IDs, allowing the spatial and temporal tokens to maintain balanced positions relative to the text tokens. To evaluate the action-scene hallucination in Video-LLMs, we introduce the UNSCENE benchmark with 1,320 videos and 4,078 QA pairs. Extensive experiments demonstrate that MASH-VLM achieves state-of-the-art results on the UNSCENE benchmark, as well as on existing video understanding benchmarks.

MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations

TL;DR

MASH-VLM targets action-scene hallucination in Video-LLMs by disentangling spatial and temporal representations inside the LLM. It introduces DST-attention, which masks direct interactions between spatial and temporal tokens, and Harmonic-RoPE, which expands positional embeddings to balance distances among token types. The authors establish UNSCENE, a benchmark of 1,320 videos with 4,078 QA pairs to quantify hallucinations, and demonstrate state-of-the-art performance on UNSCENE and other video-understanding benchmarks. The approach improves reliability of video reasoning in complex scenes, reducing hallucinations while maintaining strong generalization across tasks. These advances advance trustworthy Video-LLMs for diverse applications in AI research and real-world video understanding.

Abstract

In this work, we tackle action-scene hallucination in Video Large Language Models (Video-LLMs), where models incorrectly predict actions based on the scene context or scenes based on observed actions. We observe that existing Video-LLMs often suffer from action-scene hallucination due to two main factors. First, existing Video-LLMs intermingle spatial and temporal features by applying an attention operation across all tokens. Second, they use the standard Rotary Position Embedding (RoPE), which causes the text tokens to overemphasize certain types of tokens depending on their sequential orders. To address these issues, we introduce MASH-VLM, Mitigating Action-Scene Hallucination in Video-LLMs through disentangled spatial-temporal representations. Our approach includes two key innovations: (1) DST-attention, a novel attention mechanism that disentangles the spatial and temporal tokens within the LLM by using masked attention to restrict direct interactions between the spatial and temporal tokens; (2) Harmonic-RoPE, which extends the dimensionality of the positional IDs, allowing the spatial and temporal tokens to maintain balanced positions relative to the text tokens. To evaluate the action-scene hallucination in Video-LLMs, we introduce the UNSCENE benchmark with 1,320 videos and 4,078 QA pairs. Extensive experiments demonstrate that MASH-VLM achieves state-of-the-art results on the UNSCENE benchmark, as well as on existing video understanding benchmarks.

Paper Structure

This paper contains 46 sections, 7 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Action-scene hallucination. To evaluate action-scene hallucination in Video-LLMs, we introduce UNSCENE, an UNusual context & SCENE-only benchmark. (a) When given an unusual context video of people boxing in a library, existing Video-LLMs incorrectly predict the scene as a 'boxing ring’. When shown a scene-only video of a snow-covered mountain with no one present, these models incorrectly identify the action as 'a person skiing’ or 'snowboarding on a mountain’. Existing models frequently hallucinate actions based on the scene context or incorrectly predict scenes based on the observed actions. (b) The proposed method, MASH-VLM, achieves state-of-the-art performance on the UNSCENE benchmark, as well as on existing video understanding benchmarks.
  • Figure 2: Comparison of attention mechanism and rotary positional embedding. (a) In LLaMA llama2 and Vicuna vicuna2023, standard causal attention among all tokens often entangles visual tokens. Additionally, in standard Rotary Position Embedding (RoPE) su2024roformer, text tokens focus more on spatial tokens than on the temporal tokens due to the closer sequential order of spatial tokens. Both factors contribute to action-scene hallucinations. (b) In contrast, our MASH-VLM employs DST-attention, where attention masking prevents direct interactions between spatial and temporal tokens, promoting feature disentanglement. We also introduce Harmonic-RoPE, which expands the dimensionality of standard RoPE positional IDs, allowing spatial and temporal tokens to maintain balanced positions relative to the text tokens. As a result, MASH-VLM effectively reduces action-scene hallucinations.
  • Figure 3: Overview of MASH-VLM. We propose MASH-VLM to mitigate action-scene hallucination. MASH-VLM employs Harmonic-RoPE and DST-attention within an LLM. Harmonic-RoPE assigns additional balanced positional IDs, ensuring that spatial and temporal tokens with the same ID maintain equal relative positional distances to a text token. DST-attention then disentangles these tokens by using masked attention, preventing direct interactions between spatial and temporal tokens. Together, these innovations in MASH-VLM effectively mitigate action-scene hallucinations and significantly enhance the model’s video understanding capabilities.
  • Figure 4: Harmonic-RoPE. To overcome the limitations of standard RoPE su2024roformer, we propose Harmonic-RoPE. In its original form, standard RoPE does not assign equal positional IDs to spatial and temporal tokens. To address this, we expand the dimensionality of positional IDs, enabling spatial and temporal tokens to additionally receive balanced positional IDs relative to a text token. Specifically, we assign balanced positional IDs to the even dimensions and distinct positional IDs to the odd dimensions. By using Harmonic-RoPE, the model gains additional balanced positional information, leading to a more robust understanding of videos and ultimately mitigating action-scene hallucination.
  • Figure 5: UNSCENE Benchmark Generation Pipeline. In step 1, we collect 1,320 videos with unusual contexts and scene-only settings from YouTube. In step 2, we use GPT-4 openai2024gpt4 to generate hallucination labels of which existing video-LLMs are likely to mispredict. In step 3, we generate binary question-answer pairs for action and scene hallucinations using both hallucination labels and ground-truths. During evaluation, a model gets a score only if it predicts both dual questions correctly.
  • ...and 6 more figures