Table of Contents
Fetching ...

Unleashing Hour-Scale Video Training for Long Video-Language Understanding

Jingyang Lin, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Xiaodong Yu, Hao Chen, Jiebo Luo, Zicheng Liu, Emad Barsoum

TL;DR

This work tackles the scarcity and inefficiency of training data for hour-scale video-language understanding by introducing VideoMarathon, a large-scale synthetic long-video instruction-following dataset (≈9,700 hours, 3.3M QA pairs across 22 tasks) and Hour-LLaVA, an efficient Video-LMM that processes hour-long videos at 1 FPS using a memory augmentation (MemAug) mechanism. VideoMarathon enables rich hierarchies of captions and QA for long-term dependencies, while Hour-LLaVA preserves full-context fidelity via a memory repository and adaptive retrieval. Across four standard long-video benchmarks, Hour-LLaVA achieves state-of-the-art open-source performance, with ablations showing MemAug and 1D RoPE are critical for maintaining long-range coherence and robustness to different token-compression strategies. The results demonstrate that integrating dense long-form data with memory-augmented modeling significantly advances practical long-form video-language understanding and provides a scalable path for future research.

Abstract

Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LMMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates question-relevant and spatiotemporally informative semantics from the cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple representative long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.

Unleashing Hour-Scale Video Training for Long Video-Language Understanding

TL;DR

This work tackles the scarcity and inefficiency of training data for hour-scale video-language understanding by introducing VideoMarathon, a large-scale synthetic long-video instruction-following dataset (≈9,700 hours, 3.3M QA pairs across 22 tasks) and Hour-LLaVA, an efficient Video-LMM that processes hour-long videos at 1 FPS using a memory augmentation (MemAug) mechanism. VideoMarathon enables rich hierarchies of captions and QA for long-term dependencies, while Hour-LLaVA preserves full-context fidelity via a memory repository and adaptive retrieval. Across four standard long-video benchmarks, Hour-LLaVA achieves state-of-the-art open-source performance, with ablations showing MemAug and 1D RoPE are critical for maintaining long-range coherence and robustness to different token-compression strategies. The results demonstrate that integrating dense long-form data with memory-augmented modeling significantly advances practical long-form video-language understanding and provides a scalable path for future research.

Abstract

Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LMMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates question-relevant and spatiotemporally informative semantics from the cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple representative long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.

Paper Structure

This paper contains 21 sections, 13 figures, 11 tables.

Figures (13)

  • Figure 1: VideoMarathon: A diverse long video instruction-following dataset. (a) The dataset contains 22 diverse tasks, covering both short-form (yellow tag) and long-form (red tag) comprehension. (b) The dataset spans diverse video source domains. (c) The dataset features a wide range of question types for long-form video-language modeling. (d) The dataset consists of long videos ranging from three minutes to one hour. (e) The dataset includes complex video content reflected by the number of events per video.
  • Figure 2: Overview of the Hour-LLaVA Framework. Input video features $\mathbf{H}_\text{v}$ encoded from 1-FPS sampled frames are selectively decayed spatially and temporally through a forgetting mechanism, producing decayed video tokens $\tilde{\mathbf{H}}_\text{v}$ for efficient video modeling. Meanwhile, full video features $\mathbf{H}_\text{v}$ are stored in a memory repository. Given the decayed tokens $\tilde{\mathbf{H}}_\text{v}$ and a user question tokens $\mathbf{H}_\text{q}$, the MemAug module enhances them with full video context and user question-relevant details from the memory repository, obtaining memory-augmented video tokens $\hat{\mathbf{H}}_\text{v}$. These augmented tokens are then passed with the original user question tokens $\mathbf{H}_\text{q}$ into the LLM decoder to generate the final response $\mathbf{X}_\text{a}$.
  • Figure 3: Dataset ablation and methodology comparison. The analysis is evaluated across three benchmarks: (a) TempCompass, (b) LongVideoBench, and (c) LVBench. It presents the performance of Hour-LLaVA-3B and LLaVA-Video-3B models. The x-axis represents different training data mixture configurations, with exact ratios indicated in the legend. V.M. and L.V. refer to the VideoMarathon and LLaVA-Video-178K datasets.
  • Figure 4: Impact of compression ratio for temporal forgetting (left), and memory repository scale (middle). Comparison of the number of visual tokens input to the LLM decoder (right).
  • Figure 5: The prompt for clip-level video captioning.
  • ...and 8 more figures