Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders
Ali Rasekh, Erfan Bagheri Soula, Omid Daliran, Simon Gottschalk, Mohsen Fayyaz
TL;DR
This work tackles the core challenge of temporal understanding in Video-LLMs by introducing STAVEQ2, a model that injects stacked temporal attention blocks directly into the vision encoder to capture inter-frame dynamics across $T$ frames before feeding visual tokens to the LLM. The approach combines a parameter-efficient STA design with a two-stage training regimen and LoRA adapters, achieving state-of-the-art or competitive results on temporally demanding benchmarks such as SSv2 and cross-dataset video understanding tasks. Key findings show that explicit temporal modeling at the encoder level substantially improves fine-grained action recognition and visual-similarity tasks, and that STA enables consistent gains across multiple base architectures, including Qwen2-VL, InternVideo, VideoRoPE, and InternVideo2.5-Chat, signaling broad applicability and practical impact for video reasoning in multimodal systems. The work suggests that strengthening encoder-level temporal structure is crucial for generalization and could accelerate deployment of temporally aware Video-LLMs in real-world applications.
Abstract
Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in action recognition. We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to +5.5%. By enhancing the vision encoder with temporal structure, we address a critical gap in video understanding for Video-LLMs. Project page and code are available at: https://alirasekh.github.io/STAVEQ2/.
