Table of Contents
Fetching ...

More than a Moment: Towards Coherent Sequences of Audio Descriptions

Eshika Khandelwal, Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Andrew Zisserman, Gül Varol, Makarand Tapaswi

TL;DR

This work tackles the problem of incoherence and redundancy in automatic audio descriptions by introducing CoherentAD, a training-free pipeline that builds coherent AD sequences across video intervals. It generates multiple candidate ADs per interval from a structured interval narrative and selects a coherent sequence through auto-regressive scoring using four criteria: adherence to guidelines, redundancy reduction, story advancement, and comprehensive content counts. To evaluate sequence-level coherence, the authors propose StoryRecall and repetition metrics, showing that CoherentAD improves narrative fidelity and reduces repetition on CMD-AD, with competitive results on TV-AD and gains on ADQA. The approach demonstrates practical gains for producing narrative-aware ADs and provides a robust evaluation framework that goes beyond per-AD metrics, highlighting the limitations of traditional CIDEr-style measures for AD quality.

Abstract

Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.

More than a Moment: Towards Coherent Sequences of Audio Descriptions

TL;DR

This work tackles the problem of incoherence and redundancy in automatic audio descriptions by introducing CoherentAD, a training-free pipeline that builds coherent AD sequences across video intervals. It generates multiple candidate ADs per interval from a structured interval narrative and selects a coherent sequence through auto-regressive scoring using four criteria: adherence to guidelines, redundancy reduction, story advancement, and comprehensive content counts. To evaluate sequence-level coherence, the authors propose StoryRecall and repetition metrics, showing that CoherentAD improves narrative fidelity and reduces repetition on CMD-AD, with competitive results on TV-AD and gains on ADQA. The approach demonstrates practical gains for producing narrative-aware ADs and provides a robust evaluation framework that goes beyond per-AD metrics, highlighting the limitations of traditional CIDEr-style measures for AD quality.

Abstract

Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.

Paper Structure

This paper contains 23 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Predicted ADs across the video (i.e. a sequence of AD intervals). The results reported by per-AD evaluation metrics are shown on the top right of each prediction (left: CIDEr; right: LLM-AD-Eval, score 1-5), with low scores indicating poor performance coloured in grey. The repetitions across predictions are highlighted in red, where "adjust the device" is repeated multiple times. The video is sampled from the movie Back to the Future, corresponding to $0\mathpunct{:}22-1\mathpunct{:}05$, that can be watched here: https://www.youtube.com/watch?v=SR5BfQ4rEqQ&t=22s.
  • Figure 2: Overview of our multi-stage AD generation pipeline CoherentAD. For each AD interval, the VLM generates a structured description, which is then summarised. The summary is used to produce multiple candidate descriptions. Each candidate is scored by four independent LLM-based scorers that consider previous selections as context. The highest-scoring candidate is selected in an auto-regressive manner to form a coherent sequence.
  • Figure A.1: Qualitative comparison showing GT, our outputs, AutoAD-Zero Xie24a and Shot-by-Shot shotbyshot, with repetitions highlighted in red.
  • Figure A.2: Qualitative comparison showing GT, our outputs, AutoAD-Zero Xie24a and Shot-by-Shot shotbyshot, with repetitions highlighted in red.