AutoAD III: The Prequel -- Back to the Pixels

Tengda Han; Max Bain; Arsha Nagrani; Gül Varol; Weidi Xie; Andrew Zisserman

AutoAD III: The Prequel -- Back to the Pixels

Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman

TL;DR

This work tackles automatic movie audio description generation from pixel data by introducing CMD-AD and HowTo-AD, two large-scale pixel-based datasets that enable end-to-end AD modeling. It presents two architectures, Movie-BLIP2 and Movie-Llama2, built on frozen visual encoders and LLMs via a Q-former, augmented with a character bank to produce character-aware descriptions. To quantify AD quality beyond traditional caption metrics, the authors introduce CRITIC and LLM-AD-eval, tailored to character naming and semantic adequacy. Experiments on CMD-AD-Eval and MAD-Eval demonstrate state-of-the-art performance, with HowTo-AD pretraining delivering substantial gains, underscoring the value of diverse, large-scale pixel data for AD tasks.

Abstract

Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance. Taken together, we improve the state of the art on AD generation.

AutoAD III: The Prequel -- Back to the Pixels

TL;DR

Abstract

Paper Structure (32 sections, 4 equations, 13 figures, 7 tables)

This paper contains 32 sections, 4 equations, 13 figures, 7 tables.

Introduction
Related Work
New Datasets for Pixels to AD
CMD-AD -- Pixels from Aligned CMD
HowTo-AD -- Pixels from HowTo100M
Model Architecture
Evaluation Methods
CRITIC (Co-Referencing In Text for Identifying Characters).
LLM-AD-eval.
Experiments
Datasets
Evaluation Measures
Inter-rater Evaluations
Quantitative Results
Architecture Comparisons on Aligned-CMD.
...and 17 more sections

Figures (13)

Figure 1: We propose two new movie Audio Description (AD) datasets with pixels -- CMD-AD and HowTo-AD by temporally aligning or textually transforming existing pixel video datasets. The marker size is proportional to the total video durations and grey color indicates datasets with features instead of raw pixels.
Figure 2: Audio-audio alignment between two sources.(left): For each small audio segment on AudioVault, we find the best-matching audio segment on CMD clip, and plot two timestamps as scatters; (right): Fitting a straight line with RANSAC we can get the precise mapping function between two sources. The slope of the fitted line $0.959<1$ indicates this CMD clip plays slightly faster than the corresponding AudioVault chunk.
Figure 3: HowTo-AD dataset. We convert the LLM rewritten video descriptions (from HowToCaption) to fit movie audio descriptions by (i) uniformly replacing the subjects in descriptions with a randomly sampled name, i.e.John, and (2) constructing a character bank by providing a frame with the instructor and the randomly sampled name. The video sample is from https://youtu.be/aRbQb19v2JI.
Figure 4: Architecture overview. Our model takes as input movie frames and movie character bank from IMDb including face exemplars and character names, and produces character-aware audio descriptions. The input images/videos are first fed to a frozen visual feature extractor to obtain spatial or spatial-temporal visual features. Then it uses a shared Q-former to process the visual information and project them to the language embedding space, to leverage frozen large language models(LLM) like OPT and Llama2 for text generation.
Figure 5: Illustration of the CRITIC metric. The paragraphs consisting character list and AD (a,c) are fed into a co-referencing model to get co-referencing identities (b,d). The CRITIC metric computes an IoU between the identities in the prediction vs. the identities from the reference.
...and 8 more figures

AutoAD III: The Prequel -- Back to the Pixels

TL;DR

Abstract

AutoAD III: The Prequel -- Back to the Pixels

Authors

TL;DR

Abstract

Table of Contents

Figures (13)