Table of Contents
Fetching ...

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

Yolo Yunlong Tang, Daiki Shimada, Jing Bi, Mingqian Feng, Hang Hua, Chenliang Xu

TL;DR

This work tackles the scarcity of richly annotated untrimmed audio-visual data by synthesizing PU-VALOR, a large corpus of pseudo-untrimmed videos with precise temporal boundaries derived from VALOR-32K. It then introduces AVicuna, an audio-visual LLM equipped with an Audio-Visual Token Interleaver and Time-Event Alignment to achieve fine-grained temporal understanding and time-aware dialogue. Through a four-stage multi-modal fine-tuning pipeline and the A5-222K audio-text dataset, AVicuna sets new benchmarks on AVEDL, open-ended video QA, and audio-visual event localization, while revealing the importance of dataset design and interleaving strategy for temporal alignment. The approach demonstrates strong practical potential for temporally precise multimodal reasoning and provides insights into optimal audio-visual integration for long-form video understanding.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in natural language and multimodal domains. By fine-tuning multimodal LLMs with temporal annotations from well-annotated datasets, e.g., dense video captioning datasets, their temporal understanding capacity in video-language tasks can be obtained. However, there is a notable lack of untrimmed audio-visual video datasets with precise temporal annotations for events. This deficiency hinders LLMs from learning the alignment between time, audio-visual events, and text tokens, thus impairing their ability to temporally localize audio-visual events in videos. To address this gap, we introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,000 pseudo-untrimmed videos with detailed temporal annotations. PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering, random temporal scaling, and permutation. By fine-tuning a multimodal LLM on PU-VALOR, we developed AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens. AVicuna excels in temporal localization and time-aware dialogue capabilities. Our experiments demonstrate that AVicuna effectively handles temporal understanding in audio-visual videos and achieves state-of-the-art performance on open-ended video QA, audio-visual QA, and audio-visual event dense localization tasks.

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

TL;DR

This work tackles the scarcity of richly annotated untrimmed audio-visual data by synthesizing PU-VALOR, a large corpus of pseudo-untrimmed videos with precise temporal boundaries derived from VALOR-32K. It then introduces AVicuna, an audio-visual LLM equipped with an Audio-Visual Token Interleaver and Time-Event Alignment to achieve fine-grained temporal understanding and time-aware dialogue. Through a four-stage multi-modal fine-tuning pipeline and the A5-222K audio-text dataset, AVicuna sets new benchmarks on AVEDL, open-ended video QA, and audio-visual event localization, while revealing the importance of dataset design and interleaving strategy for temporal alignment. The approach demonstrates strong practical potential for temporally precise multimodal reasoning and provides insights into optimal audio-visual integration for long-form video understanding.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in natural language and multimodal domains. By fine-tuning multimodal LLMs with temporal annotations from well-annotated datasets, e.g., dense video captioning datasets, their temporal understanding capacity in video-language tasks can be obtained. However, there is a notable lack of untrimmed audio-visual video datasets with precise temporal annotations for events. This deficiency hinders LLMs from learning the alignment between time, audio-visual events, and text tokens, thus impairing their ability to temporally localize audio-visual events in videos. To address this gap, we introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,000 pseudo-untrimmed videos with detailed temporal annotations. PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering, random temporal scaling, and permutation. By fine-tuning a multimodal LLM on PU-VALOR, we developed AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens. AVicuna excels in temporal localization and time-aware dialogue capabilities. Our experiments demonstrate that AVicuna effectively handles temporal understanding in audio-visual videos and achieves state-of-the-art performance on open-ended video QA, audio-visual QA, and audio-visual event dense localization tasks.
Paper Structure (46 sections, 6 equations, 8 figures, 8 tables)

This paper contains 46 sections, 6 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Left: AVicuna's four-stage fine-tuning aligns natural language with exact time segments in audio-visual videos, highlighting its adeptness in dynamic content analysis. Right: AVicuna's superior performance across various video and audio-visual understanding tasks compared to other models.
  • Figure 2: Pipeline for creating the PU-VALOR dataset, which involves extracting text embeddings from high-quality audio-visual captions of the original trimmed VALOR-32K dataset, clustering these embeddings, and then applying Random Temporal Scaling & Permutation to generate pseudo-untrimmed videos. These synthesized videos are then annotated with temporal boundaries using a template-based approach to facilitate the following audio-visual time-event alignment.
  • Figure 3: AVicuna model architecture and fine-tuning process. Vision and Audio Adapters are MLPs that align modalities with LLM. The Audio-Visual Tokens Interleaver ensures temporal synchronization. LoRA fine-tuning aligns temporal boundaries with events and enhances instruction-following capabilities.
  • Figure 4: AVicuna's performances on UnAV-100 measured by mAP scores at different AIRs.
  • Figure 5: Qualitative results. Blue indicates ground-truth, green indicates the time intervals the user gives, and orange represents the model predictions. AVicuna supports audio-visual video input with various durations and resolutions. Given user queries about an event, it predicts temporal intervals accurately. A given temporal interval provides an accurate response. It also performs reasoning given a question about audio-visual context.
  • ...and 3 more figures