AutoAD III: The Prequel -- Back to the Pixels
Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman
TL;DR
This work tackles automatic movie audio description generation from pixel data by introducing CMD-AD and HowTo-AD, two large-scale pixel-based datasets that enable end-to-end AD modeling. It presents two architectures, Movie-BLIP2 and Movie-Llama2, built on frozen visual encoders and LLMs via a Q-former, augmented with a character bank to produce character-aware descriptions. To quantify AD quality beyond traditional caption metrics, the authors introduce CRITIC and LLM-AD-eval, tailored to character naming and semantic adequacy. Experiments on CMD-AD-Eval and MAD-Eval demonstrate state-of-the-art performance, with HowTo-AD pretraining delivering substantial gains, underscoring the value of diverse, large-scale pixel data for AD tasks.
Abstract
Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance. Taken together, we improve the state of the art on AD generation.
