Table of Contents
Fetching ...

AutoAD III: The Prequel -- Back to the Pixels

Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman

TL;DR

This work tackles automatic movie audio description generation from pixel data by introducing CMD-AD and HowTo-AD, two large-scale pixel-based datasets that enable end-to-end AD modeling. It presents two architectures, Movie-BLIP2 and Movie-Llama2, built on frozen visual encoders and LLMs via a Q-former, augmented with a character bank to produce character-aware descriptions. To quantify AD quality beyond traditional caption metrics, the authors introduce CRITIC and LLM-AD-eval, tailored to character naming and semantic adequacy. Experiments on CMD-AD-Eval and MAD-Eval demonstrate state-of-the-art performance, with HowTo-AD pretraining delivering substantial gains, underscoring the value of diverse, large-scale pixel data for AD tasks.

Abstract

Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance. Taken together, we improve the state of the art on AD generation.

AutoAD III: The Prequel -- Back to the Pixels

TL;DR

This work tackles automatic movie audio description generation from pixel data by introducing CMD-AD and HowTo-AD, two large-scale pixel-based datasets that enable end-to-end AD modeling. It presents two architectures, Movie-BLIP2 and Movie-Llama2, built on frozen visual encoders and LLMs via a Q-former, augmented with a character bank to produce character-aware descriptions. To quantify AD quality beyond traditional caption metrics, the authors introduce CRITIC and LLM-AD-eval, tailored to character naming and semantic adequacy. Experiments on CMD-AD-Eval and MAD-Eval demonstrate state-of-the-art performance, with HowTo-AD pretraining delivering substantial gains, underscoring the value of diverse, large-scale pixel data for AD tasks.

Abstract

Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance. Taken together, we improve the state of the art on AD generation.
Paper Structure (32 sections, 4 equations, 13 figures, 7 tables)

This paper contains 32 sections, 4 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: We propose two new movie Audio Description (AD) datasets with pixels -- CMD-AD and HowTo-AD by temporally aligning or textually transforming existing pixel video datasets. The marker size is proportional to the total video durations and grey color indicates datasets with features instead of raw pixels.
  • Figure 2: Audio-audio alignment between two sources.(left): For each small audio segment on AudioVault, we find the best-matching audio segment on CMD clip, and plot two timestamps as scatters; (right): Fitting a straight line with RANSAC we can get the precise mapping function between two sources. The slope of the fitted line $0.959<1$ indicates this CMD clip plays slightly faster than the corresponding AudioVault chunk.
  • Figure 3: HowTo-AD dataset. We convert the LLM rewritten video descriptions (from HowToCaption) to fit movie audio descriptions by (i) uniformly replacing the subjects in descriptions with a randomly sampled name, i.e.John, and (2) constructing a character bank by providing a frame with the instructor and the randomly sampled name. The video sample is from https://youtu.be/aRbQb19v2JI.
  • Figure 4: Architecture overview. Our model takes as input movie frames and movie character bank from IMDb including face exemplars and character names, and produces character-aware audio descriptions. The input images/videos are first fed to a frozen visual feature extractor to obtain spatial or spatial-temporal visual features. Then it uses a shared Q-former to process the visual information and project them to the language embedding space, to leverage frozen large language models(LLM) like OPT and Llama2 for text generation.
  • Figure 5: Illustration of the CRITIC metric. The paragraphs consisting character list and AD (a,c) are fed into a co-referencing model to get co-referencing identities (b,d). The CRITIC metric computes an IoU between the identities in the prediction vs. the identities from the reference.
  • ...and 8 more figures