Table of Contents
Fetching ...

Multimodal Long Video Modeling Based on Temporal Dynamic Context

Haoran Hao, Jiaming Han, Yiyuan Zhang, Xiangyu Yue

TL;DR

This work addresses long-video understanding under limited context by introducing Temporal Dynamic Context (TDC), which represents a video using static frame features plus dynamic multimodal context within semantically coherent scenes. A Q-Former-based compressor aggregates visual, audio, and instruction information into a compact set of temporal context tokens, enabling efficient alignment with LLMs. A training-free Long Video Chain-of-Thought (LVCoT) enables stepwise reasoning over extremely long videos by segment-wise summarization and timeline-aware final answers. Across general and audio-visual benchmarks, the approach achieves strong performance, with notable gains on long-video tasks, and is validated by a multi-stage training regimen and a public code release.

Abstract

Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amount of information within the video. Although some recent methods are designed for long video understanding, they often lose crucial information during token compression and struggle with additional modality like audio. In this work, we propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC). Firstly, we segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. Secondly, we propose a novel temporal context compressor to reduce the number of tokens within each segment. Specifically, we employ a query-based Transformer to aggregate video, audio, and instruction text tokens into a limited set of temporal context tokens. Finally, we feed the static frame tokens and the temporal context tokens into the LLM for video understanding. Furthermore, to handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments. These intermediate answers serve as part of the reasoning process and contribute to the final answer. We conduct extensive experiments on general video understanding and audio-video understanding benchmarks, where our method demonstrates strong performance. The code and models are available at https://github.com/Hoar012/TDC-Video.

Multimodal Long Video Modeling Based on Temporal Dynamic Context

TL;DR

This work addresses long-video understanding under limited context by introducing Temporal Dynamic Context (TDC), which represents a video using static frame features plus dynamic multimodal context within semantically coherent scenes. A Q-Former-based compressor aggregates visual, audio, and instruction information into a compact set of temporal context tokens, enabling efficient alignment with LLMs. A training-free Long Video Chain-of-Thought (LVCoT) enables stepwise reasoning over extremely long videos by segment-wise summarization and timeline-aware final answers. Across general and audio-visual benchmarks, the approach achieves strong performance, with notable gains on long-video tasks, and is validated by a multi-stage training regimen and a public code release.

Abstract

Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amount of information within the video. Although some recent methods are designed for long video understanding, they often lose crucial information during token compression and struggle with additional modality like audio. In this work, we propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC). Firstly, we segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. Secondly, we propose a novel temporal context compressor to reduce the number of tokens within each segment. Specifically, we employ a query-based Transformer to aggregate video, audio, and instruction text tokens into a limited set of temporal context tokens. Finally, we feed the static frame tokens and the temporal context tokens into the LLM for video understanding. Furthermore, to handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments. These intermediate answers serve as part of the reasoning process and contribute to the final answer. We conduct extensive experiments on general video understanding and audio-video understanding benchmarks, where our method demonstrates strong performance. The code and models are available at https://github.com/Hoar012/TDC-Video.

Paper Structure

This paper contains 21 sections, 4 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Comparison of Visual and Audio Encoding in Video Modeling. (a) Existing methods encode each modality separately and then concatenate them, leading to inconsistencies and difficulties in handling long videos. (b) We propose Temporal Dynamic Context (TDC) compression, which incorporates both static visual features and dynamic video context to represent videos more effectively. This approach enables better multimodal integration and efficient compression for long videos.
  • Figure 2: Architecture of Our Multimodal Video Encoder. We first extract features for each second of the video, including both visual and corresponding audio tokens. The first frame is selected as the static frame, and a Q-Former is used to perform Temporal Dynamic Context compression based on its relationship with subsequent frames, resulting in $K$ compressed tokens per frame. The final video representation consists of all static frame tokens and multimodal video context.
  • Figure 3: Qualitative Demonstrations of Our 7B Model. (a) Our model can uniformly comprehend both audio and visual information, demonstrating strong performance in audio-visual dialogue tasks. (b) In movie description tasks, it can generate detailed descriptions of both the plot and visual elements. For extremely long videos, our LVCoT processes them segment by segment. The generated segment information, along with the timeline, serves as part of the reasoning process, enriching the final output with more details.