ChronusOmni: Improving Time Awareness of Omni Large Language Models

Yijing Chen; Yihan Wu; Kaisi Guan; Yuchen Ren; Yuyue Wang; Ruihua Song; Liyun Ru

ChronusOmni: Improving Time Awareness of Omni Large Language Models

Yijing Chen, Yihan Wu, Kaisi Guan, Yuchen Ren, Yuyue Wang, Ruihua Song, Liyun Ru

TL;DR

ChronusOmni tackles audiovisual temporal grounding for omni large language models by introducing a temporally synchronized representation that interleaves explicit time tokens with video and audio tokens. The model is trained in two stages—temporal-aware supervised finetuning and reinforcement learning with GRPO—to achieve precise cross-modal timing and alignment, aided by the ChronusAV dataset designed for six cross-modal subtasks. Results show state-of-the-art performance on ChronusAV and strong gains on LongVALE and visual grounding benchmarks, while preserving general video and audio understanding. The work delivers a scalable, efficient framework for fine-grained audiovisual temporal reasoning with practical implications for long-form content analysis and multimodal AI systems.

Abstract

Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities--for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs--despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.

ChronusOmni: Improving Time Awareness of Omni Large Language Models

TL;DR

Abstract

ChronusOmni: Improving Time Awareness of Omni Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)