Table of Contents
Fetching ...

Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

Lucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol

TL;DR

Chapter-Llama introduces an efficient, LLM-based approach for hour-long video chaptering by converting video content into text through ASR transcripts and frame captions. It employs a speech-guided frame selection strategy to sample few frames, maps these frames to text, and finetunes a Llama-3.1-8B-Instruct model with LoRA to predict chapter boundaries and titles in a single forward pass, using an iterative prediction scheme for very long videos. The method achieves substantial improvements on VidChapters-7M (e.g., F1 of 45.3 versus 26.7) and demonstrates the benefit of combining ASR and captions, frame selection, and LLM finetuning, with thorough ablations on data size and modality choices. The results highlight the practicality of text-based chaptering for long videos and open avenues for hierarchical chaptering and broader audio modalities, while noting dependencies on ASR and captioning quality and potential biases in large web-trained models.

Abstract

We address the task of video chaptering, i.e., partitioning a long video timeline into semantic units and generating corresponding chapter titles. While relatively underexplored, automatic chaptering has the potential to enable efficient navigation and content retrieval in long-form videos. In this paper, we achieve strong chaptering performance on hour-long videos by efficiently addressing the problem in the text domain with our 'Chapter-Llama' framework. Specifically, we leverage a pretrained large language model (LLM) with large context window, and feed as input (i) speech transcripts and (ii) captions describing video frames, along with their respective timestamps. Given the inefficiency of exhaustively captioning all frames, we propose a lightweight speech-guided frame selection strategy based on speech transcript content, and experimentally demonstrate remarkable advantages. We train the LLM to output timestamps for the chapter boundaries, as well as free-form chapter titles. This simple yet powerful approach scales to processing one-hour long videos in a single forward pass. Our results demonstrate substantial improvements (e.g., 45.3 vs 26.7 F1 score) over the state of the art on the recent VidChapters-7M benchmark. To promote further research, we release our code and models at our project page.

Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

TL;DR

Chapter-Llama introduces an efficient, LLM-based approach for hour-long video chaptering by converting video content into text through ASR transcripts and frame captions. It employs a speech-guided frame selection strategy to sample few frames, maps these frames to text, and finetunes a Llama-3.1-8B-Instruct model with LoRA to predict chapter boundaries and titles in a single forward pass, using an iterative prediction scheme for very long videos. The method achieves substantial improvements on VidChapters-7M (e.g., F1 of 45.3 versus 26.7) and demonstrates the benefit of combining ASR and captions, frame selection, and LLM finetuning, with thorough ablations on data size and modality choices. The results highlight the practicality of text-based chaptering for long videos and open avenues for hierarchical chaptering and broader audio modalities, while noting dependencies on ASR and captioning quality and potential biases in large web-trained models.

Abstract

We address the task of video chaptering, i.e., partitioning a long video timeline into semantic units and generating corresponding chapter titles. While relatively underexplored, automatic chaptering has the potential to enable efficient navigation and content retrieval in long-form videos. In this paper, we achieve strong chaptering performance on hour-long videos by efficiently addressing the problem in the text domain with our 'Chapter-Llama' framework. Specifically, we leverage a pretrained large language model (LLM) with large context window, and feed as input (i) speech transcripts and (ii) captions describing video frames, along with their respective timestamps. Given the inefficiency of exhaustively captioning all frames, we propose a lightweight speech-guided frame selection strategy based on speech transcript content, and experimentally demonstrate remarkable advantages. We train the LLM to output timestamps for the chapter boundaries, as well as free-form chapter titles. This simple yet powerful approach scales to processing one-hour long videos in a single forward pass. Our results demonstrate substantial improvements (e.g., 45.3 vs 26.7 F1 score) over the state of the art on the recent VidChapters-7M benchmark. To promote further research, we release our code and models at our project page.

Paper Structure

This paper contains 54 sections, 11 figures, 19 tables.

Figures (11)

  • Figure 1: Chapter-Llama: Our method generates automatic video chapters for hour-long videos by training a large language model (LLM) to predict chapter boundaries and titles. The LLM processes transcribed speech (ASR) and descriptive captions of key frames, which are sampled based on ASR content. This text-based approach, equipped with speech-based frame selection, enables efficient processing of long-form content.
  • Figure 2: Method overview: Our Chapter-Llama framework first selects video frames to process using speech information. Then we use a visual captioner to map the selected frames in the text space. We feed the resulting captions, along with speech transcripts, to the LLM which outputs the chapter boundaries and titles jointly as a single sequence of tokens.
  • Figure 3: Qualitative results: We display two examples and compare our Chapter-Llama results against the ground truth (GT), as well as the zero-shot (ZS) and Vid2Seq (VS) baselines. For each example, we show the corresponding SODA (S) and CIDEr (C) scores. Our method overall shows the highest similarity with the GT, while Vid2Seq can suffer from repeated chapter titles, and zero-shot generations tend to over-segment.
  • Figure 4: Amount of training data: Our experiments show a substantial improvement when moving from zero-shot to training with 1k videos. Beyond 1k videos, performance continues to improve but at a much slower rate, motivating our choice of using only 10k training videos for our final LLM.
  • Figure A.1: Video duration distribution: Distribution of video durations in our training set (bars, left axis) and average number of chapters per duration bin (gray line, right axis). Most videos are less than 15 minutes long, with progressively fewer videos at longer durations. The average number of chapters increases with video duration but plateaus around 13 chapters for videos longer than one hour.
  • ...and 6 more figures