Table of Contents
Fetching ...

ChronusOmni: Improving Time Awareness of Omni Large Language Models

Yijing Chen, Yihan Wu, Kaisi Guan, Yuchen Ren, Yuyue Wang, Ruihua Song, Liyun Ru

TL;DR

ChronusOmni tackles audiovisual temporal grounding for omni large language models by introducing a temporally synchronized representation that interleaves explicit time tokens with video and audio tokens. The model is trained in two stages—temporal-aware supervised finetuning and reinforcement learning with GRPO—to achieve precise cross-modal timing and alignment, aided by the ChronusAV dataset designed for six cross-modal subtasks. Results show state-of-the-art performance on ChronusAV and strong gains on LongVALE and visual grounding benchmarks, while preserving general video and audio understanding. The work delivers a scalable, efficient framework for fine-grained audiovisual temporal reasoning with practical implications for long-form content analysis and multimodal AI systems.

Abstract

Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities--for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs--despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.

ChronusOmni: Improving Time Awareness of Omni Large Language Models

TL;DR

ChronusOmni tackles audiovisual temporal grounding for omni large language models by introducing a temporally synchronized representation that interleaves explicit time tokens with video and audio tokens. The model is trained in two stages—temporal-aware supervised finetuning and reinforcement learning with GRPO—to achieve precise cross-modal timing and alignment, aided by the ChronusAV dataset designed for six cross-modal subtasks. Results show state-of-the-art performance on ChronusAV and strong gains on LongVALE and visual grounding benchmarks, while preserving general video and audio understanding. The work delivers a scalable, efficient framework for fine-grained audiovisual temporal reasoning with practical implications for long-form content analysis and multimodal AI systems.

Abstract

Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities--for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs--despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.

Paper Structure

This paper contains 34 sections, 5 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Illustration of the audiovisual temporal grounding task. Three primary elements Video, Audio and Time are connected through six directional basic temporal grounding subtasks: Video-to-Time, Time-to-Video, Audio-to-Time, Time-to-Audio, Video-to-Audio, and Audio-to-Video.
  • Figure 2: The architecture of ChronusOmni. Time, video, audio are tokenized and interleaved at each time step. The token sequence is along with text prompt is input into an LLM, which is supervised finetuned and further enhanced by reinforcement learning.
  • Figure 3: Statistics of ChronusAV.
  • Figure 4: Evaluation on general video and audio understanding benchmarks. The evaluation metric for Video-MME and MUSIC-AVQA is Accuracy, with higher values being better. The evaluation metric for Librispeech and Visspeech is Word Error Rate (WER), with lower values being better. The "Base" is Ola.
  • Figure 9: Qualitative results on V2T subtask. The sample is from ChronusAV benchmark.
  • ...and 5 more figures