Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding
Yolo Yunlong Tang, Daiki Shimada, Jing Bi, Mingqian Feng, Hang Hua, Chenliang Xu
TL;DR
This work tackles the scarcity of richly annotated untrimmed audio-visual data by synthesizing PU-VALOR, a large corpus of pseudo-untrimmed videos with precise temporal boundaries derived from VALOR-32K. It then introduces AVicuna, an audio-visual LLM equipped with an Audio-Visual Token Interleaver and Time-Event Alignment to achieve fine-grained temporal understanding and time-aware dialogue. Through a four-stage multi-modal fine-tuning pipeline and the A5-222K audio-text dataset, AVicuna sets new benchmarks on AVEDL, open-ended video QA, and audio-visual event localization, while revealing the importance of dataset design and interleaving strategy for temporal alignment. The approach demonstrates strong practical potential for temporally precise multimodal reasoning and provides insights into optimal audio-visual integration for long-form video understanding.
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities in natural language and multimodal domains. By fine-tuning multimodal LLMs with temporal annotations from well-annotated datasets, e.g., dense video captioning datasets, their temporal understanding capacity in video-language tasks can be obtained. However, there is a notable lack of untrimmed audio-visual video datasets with precise temporal annotations for events. This deficiency hinders LLMs from learning the alignment between time, audio-visual events, and text tokens, thus impairing their ability to temporally localize audio-visual events in videos. To address this gap, we introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,000 pseudo-untrimmed videos with detailed temporal annotations. PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering, random temporal scaling, and permutation. By fine-tuning a multimodal LLM on PU-VALOR, we developed AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens. AVicuna excels in temporal localization and time-aware dialogue capabilities. Our experiments demonstrate that AVicuna effectively handles temporal understanding in audio-visual videos and achieves state-of-the-art performance on open-ended video QA, audio-visual QA, and audio-visual event dense localization tasks.
