Table of Contents
Fetching ...

DATE: Dynamic Absolute Time Enhancement for Long Video Understanding

Chao Yuan, Yang Yang, Yehui Yang, Zach Cheng

TL;DR

DATE introduces Timestamp Injection Mechanism (TIM) and Temporal-Aware Similarity Sampling (TASS) to empower long-video understanding in multimodal systems without retraining. TIM embeds explicit time tokens to create a continuous temporal reference, while TASS casts sampling as a vision-language retrieval problem and uses caption-based CLIP alignment plus temporally regularized greedy selection to preserve semantic relevance and temporal coverage. Across hour-long benchmarks, DATE achieves state-of-the-art results for 7B models and competitive results for 72B models, markedly improving absolute time localization and key event detection. The work highlights the importance of explicit time modeling and semantically guided sampling for robust long-video reasoning and offers inference-time enhancements that do not modify model weights.

Abstract

Long video understanding remains a fundamental challenge for multimodal large language models (MLLMs), particularly in tasks requiring precise temporal reasoning and event localization. Existing approaches typically adopt uniform frame sampling and rely on implicit position encodings to model temporal order. However, these methods struggle with long-range dependencies, leading to critical information loss and degraded temporal comprehension. In this paper, we propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal awareness in MLLMs through the Timestamp Injection Mechanism (TIM) and a semantically guided Temporal-Aware Similarity Sampling (TASS) strategy. Specifically, we interleave video frame embeddings with textual timestamp tokens to construct a continuous temporal reference system. We further reformulate the video sampling problem as a vision-language retrieval task and introduce a two-stage algorithm to ensure both semantic relevance and temporal coverage: enriching each query into a descriptive caption to better align with the vision feature, and sampling key event with a similarity-driven temporally regularized greedy strategy. Our method achieves remarkable improvements w.r.t. absolute time understanding and key event localization, resulting in state-of-the-art performance among 7B and 72B models on hour-long video benchmarks. Particularly, our 7B model even exceeds many 72B models on some benchmarks.

DATE: Dynamic Absolute Time Enhancement for Long Video Understanding

TL;DR

DATE introduces Timestamp Injection Mechanism (TIM) and Temporal-Aware Similarity Sampling (TASS) to empower long-video understanding in multimodal systems without retraining. TIM embeds explicit time tokens to create a continuous temporal reference, while TASS casts sampling as a vision-language retrieval problem and uses caption-based CLIP alignment plus temporally regularized greedy selection to preserve semantic relevance and temporal coverage. Across hour-long benchmarks, DATE achieves state-of-the-art results for 7B models and competitive results for 72B models, markedly improving absolute time localization and key event detection. The work highlights the importance of explicit time modeling and semantically guided sampling for robust long-video reasoning and offers inference-time enhancements that do not modify model weights.

Abstract

Long video understanding remains a fundamental challenge for multimodal large language models (MLLMs), particularly in tasks requiring precise temporal reasoning and event localization. Existing approaches typically adopt uniform frame sampling and rely on implicit position encodings to model temporal order. However, these methods struggle with long-range dependencies, leading to critical information loss and degraded temporal comprehension. In this paper, we propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal awareness in MLLMs through the Timestamp Injection Mechanism (TIM) and a semantically guided Temporal-Aware Similarity Sampling (TASS) strategy. Specifically, we interleave video frame embeddings with textual timestamp tokens to construct a continuous temporal reference system. We further reformulate the video sampling problem as a vision-language retrieval task and introduce a two-stage algorithm to ensure both semantic relevance and temporal coverage: enriching each query into a descriptive caption to better align with the vision feature, and sampling key event with a similarity-driven temporally regularized greedy strategy. Our method achieves remarkable improvements w.r.t. absolute time understanding and key event localization, resulting in state-of-the-art performance among 7B and 72B models on hour-long video benchmarks. Particularly, our 7B model even exceeds many 72B models on some benchmarks.

Paper Structure

This paper contains 28 sections, 3 equations, 15 figures, 3 tables, 1 algorithm.

Figures (15)

  • Figure 1: A Real example of our proposed DATE compared with Qwen2.5-VL. It shows DATE with 12 frames beats 256 frames of Qwen2.5-VL.
  • Figure 2: Overview of the proposed framework. For each user input question, using LLM-based Caption Generator to generate a CLIP-aligned image caption, and calculate the similarity with video frames. Then, use Temporal-Aware Similarity Sampling (TASS) strategy to sample the frames (The real sampled frames and orders of this demo could be found in Appendix B). Last, with Timestamp Injection Mechanism (TIM), we embed timestamps aligned with each frame.
  • Figure 3: The Multimodal RoPE (MRoPE) with our Timestamp Injection Mechanism (TIM) compared with Qwen2.5-VL's MRoPE. Qwen2.5-VL: Add 15 since there are 15 seconds betweet frames. TIM(ours): The temporal dimension $T$ is extended with time token. The spatial dimensions ($H, W$) remain aligned with the first frame, ensuring spatial consistency across the whole sequence.
  • Figure 4: A real demo compared DATE-7B with Qwen2.5-VL-7B. The caption is generated with our method and calculate similiarity scores with frames. The red points are sampled frames with TASS. More could be foung in Appendix.
  • Figure 5: Comparison of performance related to event-aware tasks in the three benchmarks: Video-MME, LongVideoBench, and LVBench.
  • ...and 10 more figures