Unleashing Hour-Scale Video Training for Long Video-Language Understanding
Jingyang Lin, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Xiaodong Yu, Hao Chen, Jiebo Luo, Zicheng Liu, Emad Barsoum
TL;DR
This work tackles the scarcity and inefficiency of training data for hour-scale video-language understanding by introducing VideoMarathon, a large-scale synthetic long-video instruction-following dataset (≈9,700 hours, 3.3M QA pairs across 22 tasks) and Hour-LLaVA, an efficient Video-LMM that processes hour-long videos at 1 FPS using a memory augmentation (MemAug) mechanism. VideoMarathon enables rich hierarchies of captions and QA for long-term dependencies, while Hour-LLaVA preserves full-context fidelity via a memory repository and adaptive retrieval. Across four standard long-video benchmarks, Hour-LLaVA achieves state-of-the-art open-source performance, with ablations showing MemAug and 1D RoPE are critical for maintaining long-range coherence and robustness to different token-compression strategies. The results demonstrate that integrating dense long-form data with memory-augmented modeling significantly advances practical long-form video-language understanding and provides a scalable path for future research.
Abstract
Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LMMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates question-relevant and spatiotemporally informative semantics from the cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple representative long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.
