Table of Contents
Fetching ...

FocusedAD: Character-centric Movie Audio Description

Xiaojun Ye, Chun Wang, Yiren Song, Sheng Zhou, Liangcheng Li, Jiajun Bu

TL;DR

FocusedAD tackles automatic movie audio description by generating character-centric narration with explicit name references and narrative relevance. It introduces a Character Perception Module to identify and track main characters, a Dynamic Prior Module that injects context from prior ADs and subtitles via learnable soft prompts, and a Focused Caption Module that fuses scene, character, and text tokens through an LLM to produce concise, story-aware descriptions. An automated pipeline builds a robust character query bank to address identity recognition across appearance changes. The approach achieves state-of-the-art performance, including strong zero-shot results on MAD-eval-Named and Cinepile-AD, demonstrating improved narrative coherence and audience accessibility for BVI users.

Abstract

Movie Audio Description (AD) aims to narrate visual content during dialogue-free segments, particularly benefiting blind and visually impaired (BVI) audiences. Compared with general video captioning, AD demands plot-relevant narration with explicit character name references, posing unique challenges in movie understanding.To identify active main characters and focus on storyline-relevant regions, we propose FocusedAD, a novel framework that delivers character-centric movie audio descriptions. It includes: (i) a Character Perception Module(CPM) for tracking character regions and linking them to names; (ii) a Dynamic Prior Module(DPM) that injects contextual cues from prior ADs and subtitles via learnable soft prompts; and (iii) a Focused Caption Module(FCM) that generates narrations enriched with plot-relevant details and named characters. To overcome limitations in character identification, we also introduce an automated pipeline for building character query banks. FocusedAD achieves state-of-the-art performance on multiple benchmarks, including strong zero-shot results on MAD-eval-Named and our newly proposed Cinepile-AD dataset. Code and data will be released at https://github.com/Thorin215/FocusedAD .

FocusedAD: Character-centric Movie Audio Description

TL;DR

FocusedAD tackles automatic movie audio description by generating character-centric narration with explicit name references and narrative relevance. It introduces a Character Perception Module to identify and track main characters, a Dynamic Prior Module that injects context from prior ADs and subtitles via learnable soft prompts, and a Focused Caption Module that fuses scene, character, and text tokens through an LLM to produce concise, story-aware descriptions. An automated pipeline builds a robust character query bank to address identity recognition across appearance changes. The approach achieves state-of-the-art performance, including strong zero-shot results on MAD-eval-Named and Cinepile-AD, demonstrating improved narrative coherence and audience accessibility for BVI users.

Abstract

Movie Audio Description (AD) aims to narrate visual content during dialogue-free segments, particularly benefiting blind and visually impaired (BVI) audiences. Compared with general video captioning, AD demands plot-relevant narration with explicit character name references, posing unique challenges in movie understanding.To identify active main characters and focus on storyline-relevant regions, we propose FocusedAD, a novel framework that delivers character-centric movie audio descriptions. It includes: (i) a Character Perception Module(CPM) for tracking character regions and linking them to names; (ii) a Dynamic Prior Module(DPM) that injects contextual cues from prior ADs and subtitles via learnable soft prompts; and (iii) a Focused Caption Module(FCM) that generates narrations enriched with plot-relevant details and named characters. To overcome limitations in character identification, we also introduce an automated pipeline for building character query banks. FocusedAD achieves state-of-the-art performance on multiple benchmarks, including strong zero-shot results on MAD-eval-Named and our newly proposed Cinepile-AD dataset. Code and data will be released at https://github.com/Thorin215/FocusedAD .

Paper Structure

This paper contains 19 sections, 9 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: FocusedAD: We propose an automated character-centric AD generation model that emphasizes main character regions' appearances and actions while incorporating narrative context. Characters appearing in the movie clip are annotated with colored bounding boxes.
  • Figure 2: Overview of FocusedAD: FocusedAD takes movie clips as input and captures the character best query bank through clustering. The Character Perception Module identifies main characters in key frames and bi-directionally propagates character regions across the entire key frame sequence. Then, through the Dynamic Prior Module, it dynamically integrates visual and text priors using soft prompts. Finally, the Focused Caption Module takes scene-level tokens, character-level tokens, and soft prompts as input to generate character-centric audio descriptions.
  • Figure 3: Character Perception Module traverses the key frame sequence, detecting main characters in any frame and obtaining their segmented regions. Videos are processed in a streaming fashion, where each frame cross-attends to the main character memories from context frames. Finally, both the region prediction and key frame embeddings are stored into memory bank.
  • Figure 4: Instruction template with soft prompt. We use a well-designed instruction template with trainable soft prompts to inject the text prior and visual prior into Focused Caption Module.
  • Figure 5: Samples of Storyboard-v2. Our dataset involves three main part, i.e., (i)movie clips, (ii) character regions, (iii) movie audio description ground-truth
  • ...and 3 more figures