Table of Contents
Fetching ...

ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries

Junfu Pu, Teng Wang, Yixiao Ge, Yuying Ge, Chen Li, Ying Shan

TL;DR

ARC-Chapter addresses the challenge of structuring hour-long videos by proposing a scalable, multimodal framework trained on a million-level long-video chapter dataset. It combines a frozen vision encoder with a trainable LLM, guided by diverse prompts to produce timestamped short titles, structural chapters, and timestamp-aligned descriptions, while using adaptive modality dropping to handle varying inputs. A novel GRACE metric captures many-to-one semantic alignment and granularity, and GRPO reinforcement learning further refines temporal boundaries without sacrificing description quality. Empirical results establish state-of-the-art performance on VidChapters7M and VidAtlas, demonstrate strong transfer to YouCook2 and ActivityNet Captions, and reveal a scaling law in video chaptering that benefits from data volume and label density.

Abstract

The proliferation of hour-long videos (e.g., lectures, podcasts, documentaries) has intensified demand for efficient content structuring. However, existing approaches are constrained by small-scale training with annotations that are typical short and coarse, restricting generalization to nuanced transitions in long videos. We introduce ARC-Chapter, the first large-scale video chaptering model trained on over million-level long video chapters, featuring bilingual, temporally grounded, and hierarchical chapter annotations. To achieve this goal, we curated a bilingual English-Chinese chapter dataset via a structured pipeline that unifies ASR transcripts, scene texts, visual captions into multi-level annotations, from short title to long summaries. We demonstrate clear performance improvements with data scaling, both in data volume and label intensity. Moreover, we design a new evaluation metric termed GRACE, which incorporates many-to-one segment overlaps and semantic similarity, better reflecting real-world chaptering flexibility. Extensive experiments demonstrate that ARC-Chapter establishes a new state-of-the-art by a significant margin, outperforming the previous best by 14.0% in F1 score and 11.3% in SODA score. Moreover, ARC-Chapter shows excellent transferability, improving the state-of-the-art on downstream tasks like dense video captioning on YouCook2.

ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries

TL;DR

ARC-Chapter addresses the challenge of structuring hour-long videos by proposing a scalable, multimodal framework trained on a million-level long-video chapter dataset. It combines a frozen vision encoder with a trainable LLM, guided by diverse prompts to produce timestamped short titles, structural chapters, and timestamp-aligned descriptions, while using adaptive modality dropping to handle varying inputs. A novel GRACE metric captures many-to-one semantic alignment and granularity, and GRPO reinforcement learning further refines temporal boundaries without sacrificing description quality. Empirical results establish state-of-the-art performance on VidChapters7M and VidAtlas, demonstrate strong transfer to YouCook2 and ActivityNet Captions, and reveal a scaling law in video chaptering that benefits from data volume and label density.

Abstract

The proliferation of hour-long videos (e.g., lectures, podcasts, documentaries) has intensified demand for efficient content structuring. However, existing approaches are constrained by small-scale training with annotations that are typical short and coarse, restricting generalization to nuanced transitions in long videos. We introduce ARC-Chapter, the first large-scale video chaptering model trained on over million-level long video chapters, featuring bilingual, temporally grounded, and hierarchical chapter annotations. To achieve this goal, we curated a bilingual English-Chinese chapter dataset via a structured pipeline that unifies ASR transcripts, scene texts, visual captions into multi-level annotations, from short title to long summaries. We demonstrate clear performance improvements with data scaling, both in data volume and label intensity. Moreover, we design a new evaluation metric termed GRACE, which incorporates many-to-one segment overlaps and semantic similarity, better reflecting real-world chaptering flexibility. Extensive experiments demonstrate that ARC-Chapter establishes a new state-of-the-art by a significant margin, outperforming the previous best by 14.0% in F1 score and 11.3% in SODA score. Moreover, ARC-Chapter shows excellent transferability, improving the state-of-the-art on downstream tasks like dense video captioning on YouCook2.

Paper Structure

This paper contains 35 sections, 3 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: An illustration of the capabilities of our video chaptering model. Given a video, our model is able to generate timestamped chapters with three-level structured output: 1) Short Title - a concise label summarizing each chapter; 2) Structural Chapter - a detailed, structured annotation for each chapter, including a rewritten comprehensive title, an abstract summarizing the core content, and an introduction describing key details and highlights; and 3) Timestamp-Aligned Video Description - fine-grained descriptions aligned with precise temporal boundaries. This hierarchical structure facilitates an efficient and precise understanding of video content.
  • Figure 2: Overview of our automatic video annotation pipeline for hierarchical chaptering and summarization. We extract visual captions (OCR included) from sampled video frames and ASR transcripts from audio. These outputs are temporally aligned and interleaved into a unified multimodal transcript. This transcript, together with original chapter markers, is processed by an LLM to produce structured chapters and timestamp-aligned video descriptions.
  • Figure 3: Dataset statistics: (a) Distribution of video durations (top) and chapter durations (bottom) in the VidAtlas dataset. (b) Distribution of video topics in VidAtlas.
  • Figure 4: Overview of the model architecture for video chaptering. The model inputs include a task-specific prompt, sampled video frames, and timestamped ASR transcripts. Video frames are processed with a frozen vision encoder. The resulting visual features, along with the tokenized prompt and ASR text, are fed into a trainable multimodal large language model (MLLM). Based on the inputs, the model is able to generate chapters in various formats, including timestamped concise title, detailed structural chapters, or comprehensive video description with timestamps.
  • Figure 5: Comparison of one-to-one (SODA) and many-to-one (GRACE) matching strategies. The one-to-one matching can fail to account for important events like $p_2$ and $g_2$, whereas the many-to-one strategy considers all predicted and ground-truth events for a more robust, overall assessment.
  • ...and 3 more figures