MCAD: Multimodal Context-Aware Audio Description Generation For Soccer

Lipisha Chaudhary; Trisha Mittal; Subhadra Gopalakrishnan; Ifeoma Nwogu; Jaclyn Pytlarz

MCAD: Multimodal Context-Aware Audio Description Generation For Soccer

Lipisha Chaudhary, Trisha Mittal, Subhadra Gopalakrishnan, Ifeoma Nwogu, Jaclyn Pytlarz

TL;DR

MCAD tackles automated audio description for soccer by leveraging a finetuned Video-LLM trained on movie AD data and enriching generation with multimodal context from commentary, players, and actions. It introduces ARGE-AD, a reference-free metric built on AD guidelines to evaluate AD quality across domains, and validates it on both movie and soccer datasets. The approach segments soccer games, retrieves contextual cues, and prompts a context-aware Video-LLM to produce ADs per clip, achieving competitive results and providing expert-annotated soccer AD data. The work highlights practical potential for scalable, domain-agnostic AD generation and identifies avenues for improving factual accuracy, timing, and personalization. The results suggest MCAD can extend accessible AD coverage to live sports and other dynamic multimedia without relying on ground-truth AD pairs.

Abstract

Audio Descriptions (AD) are essential for making visual content accessible to individuals with visual impairments. Recent works have shown a promising step towards automating AD, but they have been limited to describing high-quality movie content using human-annotated ground truth AD in the process. In this work, we present an end-to-end pipeline, MCAD, that extends AD generation beyond movies to the domain of sports, with a focus on soccer games, without relying on ground truth AD. To address the absence of domain-specific AD datasets, we fine-tune a Video Large Language Model on publicly available movie AD datasets so that it learns the narrative structure and conventions of AD. During inference, MCAD incorporates multimodal contextual cues such as player identities, soccer events and actions, and commentary from the game. These cues, combined with input prompts to the fine-tuned VideoLLM, allow the system to produce complete AD text for each video segment. We further introduce a new evaluation metric, ARGE-AD, designed to accurately assess the quality of generated AD. ARGE-AD evaluates the generated AD for the presence of five characteristics: (i) usage of people's names, (ii) mention of actions and events, (iii) appropriate length of AD, (iv) absence of pronouns, and (v) overlap from commentary or subtitles. We present an in-depth analysis of our approach on both movie and soccer datasets. We also validate the use of this metric to quantitatively comment on the quality of generated AD using our metric across domains. Additionally, we contribute audio descriptions for 100 soccer game clips annotated by two AD experts.

MCAD: Multimodal Context-Aware Audio Description Generation For Soccer

TL;DR

Abstract

MCAD: Multimodal Context-Aware Audio Description Generation For Soccer

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)