Table of Contents
Fetching ...

DistinctAD: Distinctive Audio Description Generation in Contexts

Bo Fang, Wenhao Wu, Qiangqiang Wu, Yuxin Song, Antoni B. Chan

TL;DR

Comprehensive evaluations on MAD-Eval, CMD-AD, and TV-AD benchmarks demonstrate the superiority of DistinctAD, with the model consistently outperforming baselines, particularly in Recall@k/N, highlighting its effectiveness in producing high-quality, distinctive ADs.

Abstract

Audio Descriptions (ADs) aim to provide a narration of a movie in text form, describing non-dialogue-related narratives, such as characters, actions, or scene establishment. Automatic generation of ADs remains challenging due to: i) the domain gap between movie-AD data and existing data used to train vision-language models, and ii) the issue of contextual redundancy arising from highly similar neighboring visual clips in a long movie. In this work, we propose DistinctAD, a novel two-stage framework for generating ADs that emphasize distinctiveness to produce better narratives. To address the domain gap, we introduce a CLIP-AD adaptation strategy that does not require additional AD corpora, enabling more effective alignment between movie and AD modalities at both global and fine-grained levels. In Stage-II, DistinctAD incorporates two key innovations: (i) a Contextual Expectation-Maximization Attention (EMA) module that reduces redundancy by extracting common bases from consecutive video clips, and (ii) an explicit distinctive word prediction loss that filters out repeated words in the context, ensuring the prediction of unique terms specific to the current AD. Comprehensive evaluations on MAD-Eval, CMD-AD, and TV-AD benchmarks demonstrate the superiority of DistinctAD, with the model consistently outperforming baselines, particularly in Recall@k/N, highlighting its effectiveness in producing high-quality, distinctive ADs.

DistinctAD: Distinctive Audio Description Generation in Contexts

TL;DR

Comprehensive evaluations on MAD-Eval, CMD-AD, and TV-AD benchmarks demonstrate the superiority of DistinctAD, with the model consistently outperforming baselines, particularly in Recall@k/N, highlighting its effectiveness in producing high-quality, distinctive ADs.

Abstract

Audio Descriptions (ADs) aim to provide a narration of a movie in text form, describing non-dialogue-related narratives, such as characters, actions, or scene establishment. Automatic generation of ADs remains challenging due to: i) the domain gap between movie-AD data and existing data used to train vision-language models, and ii) the issue of contextual redundancy arising from highly similar neighboring visual clips in a long movie. In this work, we propose DistinctAD, a novel two-stage framework for generating ADs that emphasize distinctiveness to produce better narratives. To address the domain gap, we introduce a CLIP-AD adaptation strategy that does not require additional AD corpora, enabling more effective alignment between movie and AD modalities at both global and fine-grained levels. In Stage-II, DistinctAD incorporates two key innovations: (i) a Contextual Expectation-Maximization Attention (EMA) module that reduces redundancy by extracting common bases from consecutive video clips, and (ii) an explicit distinctive word prediction loss that filters out repeated words in the context, ensuring the prediction of unique terms specific to the current AD. Comprehensive evaluations on MAD-Eval, CMD-AD, and TV-AD benchmarks demonstrate the superiority of DistinctAD, with the model consistently outperforming baselines, particularly in Recall@k/N, highlighting its effectiveness in producing high-quality, distinctive ADs.

Paper Structure

This paper contains 16 sections, 13 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Illustration of Stage-I: CLIP-AD Adaptation. This process involves adapting the CLIP vision encoder to specific movie-AD data through global-level video-AD matching (bottom right) and fine-grained frame-AD matching (top right).
  • Figure 2: Pipeline of Stage-II: Distinctive AD Narration. Stage-II processes $N$ consecutive video clips using the CLIP$_{\mathrm{AD}}$ vision encoder from Stage-I. We generate contextual-distinctive ADs by two key innovations: i) a Contextual EMA module to learn compact and discriminative visual representations for improved prompting of LLMs; ii) an extra distinctive word loss for predicting AD-specific terms.
  • Figure 3: Ablation studies for hyperparameter in Stage-II, with final settings highlighted in orange. (a) Impact of $\alpha$ on the weight of compact representation $\widehat{\mathcal{H}}$. (b) Influence of $\beta$ on cross-attended feature $\widetilde{\mathcal{H}}$. (c) Impact of $K$, which denotes the number of clusters in bases $\mathcal{M}$. (d) Effect of sampling $N$ consecutive video clips. We switch to larger memory GPUs when $N$ exceeds 16.
  • Figure 4: Qualitative results. We present ground-truth (GT) ADs, publicly released AutoAD-Zero outputs, and our DistinctAD predictions for several temporally consecutive movie clips. Movie frames are taken from The Ides of March (2011) ides_march. Zoom in for details.
  • Figure 5: Visualizations of Contextual EMA. (a) A set of randomly generated 3D data $\mathcal{H}$, sampled from $N$ types of samples. (b) Compact features $\widehat{\mathcal{H}}$ obtained via Data Re-estimation (DR). (c) Cross-attention outputs $\widetilde{\mathcal{H}}$ between $\mathcal{H}$ and bases $\mathcal{M}$.
  • ...and 4 more figures