Table of Contents
Fetching ...

What You See is What You Ask: Evaluating Audio Descriptions

Divy Kala, Eshika Khandelwal, Makarand Tapaswi

TL;DR

The paper introduces ADQA, a QA-based benchmark for evaluating automatic audio descriptions on few-minute video segments to better support blind and low-vision viewers. It demonstrates substantial subjectivity in ADs by aligning two independent tracks and shows that traditional trimmed-clip evaluation and single-reference metrics are inadequate. ADQA jointly assesses visual appreciation and narrative understanding via dashboarded LLM-based QA, revealing that human-authored ADs still outperform current generation methods and highlighting the need for longer-context, narrative-aware AD models. The work provides a public leaderboard, a thorough analysis of AD generation methods, and practical recommendations to advance AD research toward more coherent, informative descriptions that enhance both visual engagement and story comprehension.

Abstract

Audio descriptions (ADs) narrate important visual details in movies, enabling Blind and Low Vision (BLV) users to understand narratives and appreciate visual details. Existing works in automatic AD generation mostly focus on few-second trimmed clips, and evaluate them by comparing against a single ground-truth reference AD. However, writing ADs is inherently subjective. Through alignment and analysis of two independent AD tracks for the same movies, we quantify the subjectivity in when and whether to describe, and what and how to highlight. Thus, we show that working with trimmed clips is inadequate. We propose ADQA, a QA benchmark that evaluates ADs at the level of few-minute long, coherent video segments, testing whether they would help BLV users understand the story and appreciate visual details. ADQA features visual appreciation (VA) questions about visual facts and narrative understanding (NU) questions based on the plot. Through ADQA, we show that current AD generation methods lag far behind human-authored ADs. We conclude with several recommendations for future work and introduce a public leaderboard for benchmarking.

What You See is What You Ask: Evaluating Audio Descriptions

TL;DR

The paper introduces ADQA, a QA-based benchmark for evaluating automatic audio descriptions on few-minute video segments to better support blind and low-vision viewers. It demonstrates substantial subjectivity in ADs by aligning two independent tracks and shows that traditional trimmed-clip evaluation and single-reference metrics are inadequate. ADQA jointly assesses visual appreciation and narrative understanding via dashboarded LLM-based QA, revealing that human-authored ADs still outperform current generation methods and highlighting the need for longer-context, narrative-aware AD models. The work provides a public leaderboard, a thorough analysis of AD generation methods, and practical recommendations to advance AD research toward more coherent, informative descriptions that enhance both visual engagement and story comprehension.

Abstract

Audio descriptions (ADs) narrate important visual details in movies, enabling Blind and Low Vision (BLV) users to understand narratives and appreciate visual details. Existing works in automatic AD generation mostly focus on few-second trimmed clips, and evaluate them by comparing against a single ground-truth reference AD. However, writing ADs is inherently subjective. Through alignment and analysis of two independent AD tracks for the same movies, we quantify the subjectivity in when and whether to describe, and what and how to highlight. Thus, we show that working with trimmed clips is inadequate. We propose ADQA, a QA benchmark that evaluates ADs at the level of few-minute long, coherent video segments, testing whether they would help BLV users understand the story and appreciate visual details. ADQA features visual appreciation (VA) questions about visual facts and narrative understanding (NU) questions based on the plot. Through ADQA, we show that current AD generation methods lag far behind human-authored ADs. We conclude with several recommendations for future work and introduce a public leaderboard for benchmarking.

Paper Structure

This paper contains 58 sections, 1 equation, 9 figures, 9 tables.

Figures (9)

  • Figure 1: We present ADQA's question generation and answering framework. A small part of a video from the film Liar Liar from CMD-AD han2024autoad3 is shown. AD Track 1 and 2 from AudioVault show the dialogs and ADs describing the video in different ways. The plot summary is taken from CMD bain2020condensedmovies. AD Track 1 is used to create Visual Appreciation questions, whereas the plot summary is used to create Narrative Understanding questions. LLMs are prompted to answer both question types using the AD track under evaluation, here, AD Track 2. The video can be watched here: https://youtu.be/IsBB4i4k2PM.
  • Figure 2: Impact of overlap threshold on AD alignment on the two-track subset of CMD-AD movies. The fraction (%) of non-aligned ADs increases with threshold (expected). Interestingly, even at low thresholds, 25-30% ADs are not aligned indicating that many ADs in one track are not present in the other. Additionally, CIDEr does not increase with better temporal overlap (high threshold) suggesting that even well-aligned ADs often differ substantially in wording.
  • Figure 3: BERT Similarity (B) vs. CIDEr (C) for time-aligned ADs from two AD tracks on 17 movies from the CMD-AD dataset. The quadrants and $\uparrow$ or $\downarrow$ labels are separated by median scores (B: 86.2, C: 3.1) and the proportion of samples in each quadrant is in P %. We summarize the reasons for these scores in the table.
  • Figure 4: Prompt to classify transcriptions into "dialogue" or "AD".
  • Figure 5: Prompt used to align the plot synopses sentences with a dialog + AD movie "script" (not the real script).
  • ...and 4 more figures