Table of Contents
Fetching ...

MCAD: Multimodal Context-Aware Audio Description Generation For Soccer

Lipisha Chaudhary, Trisha Mittal, Subhadra Gopalakrishnan, Ifeoma Nwogu, Jaclyn Pytlarz

TL;DR

MCAD tackles automated audio description for soccer by leveraging a finetuned Video-LLM trained on movie AD data and enriching generation with multimodal context from commentary, players, and actions. It introduces ARGE-AD, a reference-free metric built on AD guidelines to evaluate AD quality across domains, and validates it on both movie and soccer datasets. The approach segments soccer games, retrieves contextual cues, and prompts a context-aware Video-LLM to produce ADs per clip, achieving competitive results and providing expert-annotated soccer AD data. The work highlights practical potential for scalable, domain-agnostic AD generation and identifies avenues for improving factual accuracy, timing, and personalization. The results suggest MCAD can extend accessible AD coverage to live sports and other dynamic multimedia without relying on ground-truth AD pairs.

Abstract

Audio Descriptions (AD) are essential for making visual content accessible to individuals with visual impairments. Recent works have shown a promising step towards automating AD, but they have been limited to describing high-quality movie content using human-annotated ground truth AD in the process. In this work, we present an end-to-end pipeline, MCAD, that extends AD generation beyond movies to the domain of sports, with a focus on soccer games, without relying on ground truth AD. To address the absence of domain-specific AD datasets, we fine-tune a Video Large Language Model on publicly available movie AD datasets so that it learns the narrative structure and conventions of AD. During inference, MCAD incorporates multimodal contextual cues such as player identities, soccer events and actions, and commentary from the game. These cues, combined with input prompts to the fine-tuned VideoLLM, allow the system to produce complete AD text for each video segment. We further introduce a new evaluation metric, ARGE-AD, designed to accurately assess the quality of generated AD. ARGE-AD evaluates the generated AD for the presence of five characteristics: (i) usage of people's names, (ii) mention of actions and events, (iii) appropriate length of AD, (iv) absence of pronouns, and (v) overlap from commentary or subtitles. We present an in-depth analysis of our approach on both movie and soccer datasets. We also validate the use of this metric to quantitatively comment on the quality of generated AD using our metric across domains. Additionally, we contribute audio descriptions for 100 soccer game clips annotated by two AD experts.

MCAD: Multimodal Context-Aware Audio Description Generation For Soccer

TL;DR

MCAD tackles automated audio description for soccer by leveraging a finetuned Video-LLM trained on movie AD data and enriching generation with multimodal context from commentary, players, and actions. It introduces ARGE-AD, a reference-free metric built on AD guidelines to evaluate AD quality across domains, and validates it on both movie and soccer datasets. The approach segments soccer games, retrieves contextual cues, and prompts a context-aware Video-LLM to produce ADs per clip, achieving competitive results and providing expert-annotated soccer AD data. The work highlights practical potential for scalable, domain-agnostic AD generation and identifies avenues for improving factual accuracy, timing, and personalization. The results suggest MCAD can extend accessible AD coverage to live sports and other dynamic multimedia without relying on ground-truth AD pairs.

Abstract

Audio Descriptions (AD) are essential for making visual content accessible to individuals with visual impairments. Recent works have shown a promising step towards automating AD, but they have been limited to describing high-quality movie content using human-annotated ground truth AD in the process. In this work, we present an end-to-end pipeline, MCAD, that extends AD generation beyond movies to the domain of sports, with a focus on soccer games, without relying on ground truth AD. To address the absence of domain-specific AD datasets, we fine-tune a Video Large Language Model on publicly available movie AD datasets so that it learns the narrative structure and conventions of AD. During inference, MCAD incorporates multimodal contextual cues such as player identities, soccer events and actions, and commentary from the game. These cues, combined with input prompts to the fine-tuned VideoLLM, allow the system to produce complete AD text for each video segment. We further introduce a new evaluation metric, ARGE-AD, designed to accurately assess the quality of generated AD. ARGE-AD evaluates the generated AD for the presence of five characteristics: (i) usage of people's names, (ii) mention of actions and events, (iii) appropriate length of AD, (iv) absence of pronouns, and (v) overlap from commentary or subtitles. We present an in-depth analysis of our approach on both movie and soccer datasets. We also validate the use of this metric to quantitatively comment on the quality of generated AD using our metric across domains. Additionally, we contribute audio descriptions for 100 soccer game clips annotated by two AD experts.

Paper Structure

This paper contains 35 sections, 2 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: Autogenerating Audio Descriptions (ADs) For Soccer Games: We propose $\textbf{MCAD}$, a framework to generate AD for domains beyond movies, with a focus on soccer games as a domain. $\textbf{MCAD}$ enriches the generated AD by capturing all context cues like team, league, player names, actions, and commentary. We also propose ${\textit{A}\textit{R}\textit{G}\textit{E}\textbf{-}\textit{A}\textit{D}}$, a reference-free metric based on AD conventions to evaluate the generated ADs.
  • Figure 2: $\textbf{MCAD}$ for Soccer Games: We present an overview of our framework for using $\textbf{MCAD}$ to generate AD for sports. The first step, ($(i)$) is the finetuning step, where we leverage the huge amount of available groundtruth AD data for movie clips, to develop AD-VidLlama2, a Video-LLM that is enriched with AD aspects. In $(ii), (iii), (iv)$ we explain how we can perform an inference to now use the finetuned AD-VidLlama2 for generating ADs for sports game clips. In (ii) we take an entire game video, $\mathcal{I}$ and use scene detection to divide it into smaller game clips, $\mathcal{I}_1 \dots \mathcal{I}_i \dots \mathcal{I}_N$. In (iii) we focus on retrieving contextual cues for a particular game clip, $\mathcal{I}_i$. We get the corresponding commentary text $c_i$, player names $p_i^k$ and also actions $a_i^k$. And, finally in (iv) we combine the retrieved context in the prompt, $\mathcal{P}$ and the input video clip $\mathcal{I}_i$ to generate the AD $\mathcal{\widehat{Y}}_i$.
  • Figure 3: Qualitative Results: We show qualitative visualizations for CMD-AD (top) and SoccerNet (bottom). The examples are from "The Man Who Wasn't There" (top left), "Back to the Future" (top middle), and "Much Ado About Nothing" (top right). The SoccerNet-S examples are from Real Madrid vs Betis [Spain LaLiga (2015-2016)], scene length is 15 secs(bottom left) and AC Milan vs Empoli [Italy Serie A (2015-2016)] (bottom right) scene length is 30 secs.
  • Figure 4: Additional Qualitative Results: We show qualitative visualizations for NBA game (left) and Street Navigation (right).
  • Figure 5: Prompting in $\textbf{MCAD}$: We present the three prompt variants used to evaluate $\textbf{MCAD}$. We also depict how we provide contextual cues information along with the prompts.
  • ...and 8 more figures