Table of Contents
Fetching ...

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, Feng Zheng

TL;DR

<3-5 sentence high-level summary> LongVALE introduces the first omni-modal benchmark for long videos, aligning vision, audio, and speech with precise temporal boundaries and correlation-aware captions to support fine-grained, time-aware understanding. It presents a scalable, automatic data-generation pipeline covering video filtering, omni-modal boundary detection, and cross-modal captioning, yielding 8.4K long videos with 105K events and rich cross-modal narratives. Building on this dataset, LongVALE-LLM couples multi-modal encoders with a LoRA-enhanced LLM in a boundary-perception and instruction-tuning regime, achieving state-of-the-art or competitive results on three omni-modal tasks and zero-shot AVQA with relatively modest training data. The work demonstrates the practical potential of time-aware omni-modal perception for long videos and provides a foundation for future improvements in multi-modal video understanding and generalizable video-language models.

Abstract

Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time. Extensive experiments demonstrate the effectiveness and great potential of LongVALE in advancing comprehensive multi-modal video understanding.

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

TL;DR

<3-5 sentence high-level summary> LongVALE introduces the first omni-modal benchmark for long videos, aligning vision, audio, and speech with precise temporal boundaries and correlation-aware captions to support fine-grained, time-aware understanding. It presents a scalable, automatic data-generation pipeline covering video filtering, omni-modal boundary detection, and cross-modal captioning, yielding 8.4K long videos with 105K events and rich cross-modal narratives. Building on this dataset, LongVALE-LLM couples multi-modal encoders with a LoRA-enhanced LLM in a boundary-perception and instruction-tuning regime, achieving state-of-the-art or competitive results on three omni-modal tasks and zero-shot AVQA with relatively modest training data. The work demonstrates the practical potential of time-aware omni-modal perception for long videos and provides a foundation for future improvements in multi-modal video understanding and generalizable video-language models.

Abstract

Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time. Extensive experiments demonstrate the effectiveness and great potential of LongVALE in advancing comprehensive multi-modal video understanding.

Paper Structure

This paper contains 31 sections, 2 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: We introduce LongVALE, the first-ever omni-modality long video benchmark, offering precise temporal boundaries and captions for omni-modal events integrating visual, audio, and speech information. The captions feature audio-visual correlations to enhance cross-modal learning. Besides, we extend three fine-grained video tasks to the omni-modality domain, enabling omni-perception of long videos.
  • Figure 2: The pipeline for high-quality omni-modality fine-grained data generation. It starts by detecting visual and audio event boundaries based on their distinct properties. Next, we generate detailed captions for each video and audio event enhanced by keyframe and speech captions. We then determine omni-modal event boundaries by maintaining the semantic integrity of single-modal events. Finally, omni-modal event captions are generated by audio-visual correlation reasoning, followed by manual refinement to ensure data's high quality.
  • Figure 3: Statistics of LongVALE benchmark. (a) Video duration distribution of both training and test sets. (b) Distribution of the number of omni-modal events in videos for both training and test sets. (c) Distribution of omni-modal event duration. (d) Distribution of audio-visual correlation types. The examples of omni-modal events with different audio-visual correlations are also illustrated.
  • Figure 4: LongVALE-LLM architecture with boundary perception and instruction tuning stages using our LongVALE dataset.
  • Figure 5: Qualitative results. The orange text highlights audio-visual correlation for accurate and complete video understanding. Samples are from LongVALE and Music-AVQA test sets.
  • ...and 10 more figures