Table of Contents
Fetching ...

Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

Ali Rasekh, Erfan Bagheri Soula, Omid Daliran, Simon Gottschalk, Mohsen Fayyaz

TL;DR

This work tackles the core challenge of temporal understanding in Video-LLMs by introducing STAVEQ2, a model that injects stacked temporal attention blocks directly into the vision encoder to capture inter-frame dynamics across $T$ frames before feeding visual tokens to the LLM. The approach combines a parameter-efficient STA design with a two-stage training regimen and LoRA adapters, achieving state-of-the-art or competitive results on temporally demanding benchmarks such as SSv2 and cross-dataset video understanding tasks. Key findings show that explicit temporal modeling at the encoder level substantially improves fine-grained action recognition and visual-similarity tasks, and that STA enables consistent gains across multiple base architectures, including Qwen2-VL, InternVideo, VideoRoPE, and InternVideo2.5-Chat, signaling broad applicability and practical impact for video reasoning in multimodal systems. The work suggests that strengthening encoder-level temporal structure is crucial for generalization and could accelerate deployment of temporally aware Video-LLMs in real-world applications.

Abstract

Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in action recognition. We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to +5.5%. By enhancing the vision encoder with temporal structure, we address a critical gap in video understanding for Video-LLMs. Project page and code are available at: https://alirasekh.github.io/STAVEQ2/.

Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

TL;DR

This work tackles the core challenge of temporal understanding in Video-LLMs by introducing STAVEQ2, a model that injects stacked temporal attention blocks directly into the vision encoder to capture inter-frame dynamics across frames before feeding visual tokens to the LLM. The approach combines a parameter-efficient STA design with a two-stage training regimen and LoRA adapters, achieving state-of-the-art or competitive results on temporally demanding benchmarks such as SSv2 and cross-dataset video understanding tasks. Key findings show that explicit temporal modeling at the encoder level substantially improves fine-grained action recognition and visual-similarity tasks, and that STA enables consistent gains across multiple base architectures, including Qwen2-VL, InternVideo, VideoRoPE, and InternVideo2.5-Chat, signaling broad applicability and practical impact for video reasoning in multimodal systems. The work suggests that strengthening encoder-level temporal structure is crucial for generalization and could accelerate deployment of temporally aware Video-LLMs in real-world applications.

Abstract

Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in action recognition. We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to +5.5%. By enhancing the vision encoder with temporal structure, we address a critical gap in video understanding for Video-LLMs. Project page and code are available at: https://alirasekh.github.io/STAVEQ2/.

Paper Structure

This paper contains 30 sections, 8 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Responses from Qwen2-VL and our STAVEQ2. Left: For a temporally simple action (Biking), both models answer correctly. Right: For a temporally challenging action (pulling something from right to left), Qwen2-VL provides an incorrect answer, while our STAVEQ2 succeeds.
  • Figure 2: Confusion matrices of InternVideo2-Chat performing action recognition on SSv2-T10 showing results on the following classes: (1) Pulling [something] from left to right; (2) Pulling [something] from right to left; (3) Throwing [something] in the air and catching it; (4) Throwing [something] in the air and letting it fall; (5) [Something] falling like a rock.
  • Figure 3: Our proposed STAVEQ2 architecture. Video frames are processed through transformer blocks with spatial and stacked temporal attention modules, capturing intra-frame and inter-frame dynamics. The resulting visual tokens are fed into the LLM for answer generation.
  • Figure G.1: Attention maps for the action poking [something] so that it falls over.