More than a Moment: Towards Coherent Sequences of Audio Descriptions
Eshika Khandelwal, Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Andrew Zisserman, Gül Varol, Makarand Tapaswi
TL;DR
This work tackles the problem of incoherence and redundancy in automatic audio descriptions by introducing CoherentAD, a training-free pipeline that builds coherent AD sequences across video intervals. It generates multiple candidate ADs per interval from a structured interval narrative and selects a coherent sequence through auto-regressive scoring using four criteria: adherence to guidelines, redundancy reduction, story advancement, and comprehensive content counts. To evaluate sequence-level coherence, the authors propose StoryRecall and repetition metrics, showing that CoherentAD improves narrative fidelity and reduces repetition on CMD-AD, with competitive results on TV-AD and gains on ADQA. The approach demonstrates practical gains for producing narrative-aware ADs and provides a robust evaluation framework that goes beyond per-AD metrics, highlighting the limitations of traditional CIDEr-style measures for AD quality.
Abstract
Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.
