LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos
Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, Feng Zheng
TL;DR
<3-5 sentence high-level summary> LongVALE introduces the first omni-modal benchmark for long videos, aligning vision, audio, and speech with precise temporal boundaries and correlation-aware captions to support fine-grained, time-aware understanding. It presents a scalable, automatic data-generation pipeline covering video filtering, omni-modal boundary detection, and cross-modal captioning, yielding 8.4K long videos with 105K events and rich cross-modal narratives. Building on this dataset, LongVALE-LLM couples multi-modal encoders with a LoRA-enhanced LLM in a boundary-perception and instruction-tuning regime, achieving state-of-the-art or competitive results on three omni-modal tasks and zero-shot AVQA with relatively modest training data. The work demonstrates the practical potential of time-aware omni-modal perception for long videos and provides a foundation for future improvements in multi-modal video understanding and generalizable video-language models.
Abstract
Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time. Extensive experiments demonstrate the effectiveness and great potential of LongVALE in advancing comprehensive multi-modal video understanding.
