What You See is What You Ask: Evaluating Audio Descriptions
Divy Kala, Eshika Khandelwal, Makarand Tapaswi
TL;DR
The paper introduces ADQA, a QA-based benchmark for evaluating automatic audio descriptions on few-minute video segments to better support blind and low-vision viewers. It demonstrates substantial subjectivity in ADs by aligning two independent tracks and shows that traditional trimmed-clip evaluation and single-reference metrics are inadequate. ADQA jointly assesses visual appreciation and narrative understanding via dashboarded LLM-based QA, revealing that human-authored ADs still outperform current generation methods and highlighting the need for longer-context, narrative-aware AD models. The work provides a public leaderboard, a thorough analysis of AD generation methods, and practical recommendations to advance AD research toward more coherent, informative descriptions that enhance both visual engagement and story comprehension.
Abstract
Audio descriptions (ADs) narrate important visual details in movies, enabling Blind and Low Vision (BLV) users to understand narratives and appreciate visual details. Existing works in automatic AD generation mostly focus on few-second trimmed clips, and evaluate them by comparing against a single ground-truth reference AD. However, writing ADs is inherently subjective. Through alignment and analysis of two independent AD tracks for the same movies, we quantify the subjectivity in when and whether to describe, and what and how to highlight. Thus, we show that working with trimmed clips is inadequate. We propose ADQA, a QA benchmark that evaluates ADs at the level of few-minute long, coherent video segments, testing whether they would help BLV users understand the story and appreciate visual details. ADQA features visual appreciation (VA) questions about visual facts and narrative understanding (NU) questions based on the plot. Through ADQA, we show that current AD generation methods lag far behind human-authored ADs. We conclude with several recommendations for future work and introduce a public leaderboard for benchmarking.
