Table of Contents
Fetching ...

AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman

TL;DR

This work demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.

Abstract

Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner. We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs), and develop visual and text prompting strategies for this task. Our contributions are three-fold: (i) We demonstrate that a VLM can successfully name and refer to characters if directly prompted with character information through visual indications without requiring any fine-tuning; (ii) A two-stage process is developed to generate ADs, with the first stage asking the VLM to comprehensively describe the video, followed by a second stage utilising a LLM to summarise dense textual information into one succinct AD sentence; (iii) A new dataset for TV audio description is formulated. Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.

AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

TL;DR

This work demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.

Abstract

Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner. We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs), and develop visual and text prompting strategies for this task. Our contributions are three-fold: (i) We demonstrate that a VLM can successfully name and refer to characters if directly prompted with character information through visual indications without requiring any fine-tuning; (ii) A two-stage process is developed to generate ADs, with the first stage asking the VLM to comprehensively describe the video, followed by a second stage utilising a LLM to summarise dense textual information into one succinct AD sentence; (iii) A new dataset for TV audio description is formulated. Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.
Paper Structure (25 sections, 2 equations, 9 figures, 11 tables, 3 algorithms)

This paper contains 25 sections, 2 equations, 9 figures, 11 tables, 3 algorithms.

Figures (9)

  • Figure 1: A training-free framework for zero-shot AD generation. AutoAD-Zero features a two-stage process, where a VLM initially generates a comprehensive video description from multiple aspects, followed by an LLM-based AD summary in the second stage. To incorporate character information into this framework, character faces in the input video are matched with those in an external character bank and labelled with coloured circles. The corresponding character names and colour codes are then provided as text prompts to the VLM.
  • Figure 2: Character recognition and VLM prompting. An off-the-shelf face detection model is employed to obtain bounding boxes and face features in video frames. These "in-frame face features" are then matched with "portrait face features" extracted from character profile images, which determines the identities in the video. To prompt the VLM with character information, character faces are labelled by coloured circles, with corresponding names and colour codes provided in the text prompt.
  • Figure 3: Two-stage training-free AD generation. The first stage adopts a VLM to produce a comprehensive video description, covering aspects including main characters, actions, interactions, and facial expressions. The second stage uses an LLM to summarise the video into a single AD sentence, extracting the most relevant character and action information, and adjusting the content and style according to specific rules.
  • Figure 4: Comparison of AD duration between TV-AD and CMD-AD.
  • Figure 5: Dataset formulation. The aim is to convert AD annotations from AudioVault into text form and align them with TV episodes. The main pipeline consists of two steps: (i) The soundtrack pre-processing step, which aligns the AudioVault and TV timestamps via audio-audio matching and transcribes both sound sources into transcripts; (ii) The AD filtering step, which retrieves text-form ADs from the transcripts and performs further cleaning.
  • ...and 4 more figures