Table of Contents
Fetching ...

SwissADT: An Audio Description Translation System for Swiss Languages

Lukas Fischer, Yingqiang Gao, Alexa Lintner, Sarah Ebling

TL;DR

SwissADT is presented, the first ADT system implemented for three main Swiss languages and English, and it is believed that combining human expertise with the generation power of LLMs can further enhance the performance of ADT systems, ultimately benefiting a larger multilingual target population.

Abstract

Audio description (AD) is a crucial accessibility service provided to blind persons and persons with visual impairment, designed to convey visual information in acoustic form. Despite recent advancements in multilingual machine translation research, the lack of well-crafted and time-synchronized AD data impedes the development of audio description translation (ADT) systems that address the needs of multilingual countries such as Switzerland. Furthermore, since the majority of ADT systems rely solely on text, uncertainty exists as to whether incorporating visual information from the corresponding video clips can enhance the quality of ADT outputs. In this work, we present SwissADT, the first ADT system implemented for three main Swiss languages and English. By collecting well-crafted AD data augmented with video clips in German, French, Italian, and English, and leveraging the power of Large Language Models (LLMs), we aim to enhance information accessibility for diverse language populations in Switzerland by automatically translating AD scripts to the desired Swiss language. Our extensive experimental ADT results, composed of both automatic and human evaluations of ADT quality, demonstrate the promising capability of SwissADT for the ADT task. We believe that combining human expertise with the generation power of LLMs can further enhance the performance of ADT systems, ultimately benefiting a larger multilingual target population.

SwissADT: An Audio Description Translation System for Swiss Languages

TL;DR

SwissADT is presented, the first ADT system implemented for three main Swiss languages and English, and it is believed that combining human expertise with the generation power of LLMs can further enhance the performance of ADT systems, ultimately benefiting a larger multilingual target population.

Abstract

Audio description (AD) is a crucial accessibility service provided to blind persons and persons with visual impairment, designed to convey visual information in acoustic form. Despite recent advancements in multilingual machine translation research, the lack of well-crafted and time-synchronized AD data impedes the development of audio description translation (ADT) systems that address the needs of multilingual countries such as Switzerland. Furthermore, since the majority of ADT systems rely solely on text, uncertainty exists as to whether incorporating visual information from the corresponding video clips can enhance the quality of ADT outputs. In this work, we present SwissADT, the first ADT system implemented for three main Swiss languages and English. By collecting well-crafted AD data augmented with video clips in German, French, Italian, and English, and leveraging the power of Large Language Models (LLMs), we aim to enhance information accessibility for diverse language populations in Switzerland by automatically translating AD scripts to the desired Swiss language. Our extensive experimental ADT results, composed of both automatic and human evaluations of ADT quality, demonstrate the promising capability of SwissADT for the ADT task. We believe that combining human expertise with the generation power of LLMs can further enhance the performance of ADT systems, ultimately benefiting a larger multilingual target population.

Paper Structure

This paper contains 26 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: (a) Overview of SwissADT: An end-to-end pipeline that translates a given AD segment from English to the three main languages of Switzerland with the most salient video frames; (b) Detail of the moment retriever: it selects a moment, i.e., the most salient sequence of consecutive frames, to augment the translation inputs; (c) Detail of the frame sampler: it linearly interpolates the retrieved moment to obtain a cascade of frames used as inputs to the AD translator. In our implementation, we choose LLMs (GPT-4 models) as the AD translator due to their superior capabilities for performing multilingual machine translation tasks.
  • Figure 2: An example of a German AD script with spoken subtitles and special characters used in our data schema. The presence of a dollar sign ($) signifies a constrained timeframe of faster pace of speech. An asterisk sign (*) indicates a scene change within the script. Spoken subtitles are marked by UT as an abbreviation for "Untertitel" in German.
  • Figure 3: Two examples of ambiguity that require additional context for resolution. The words that are correctly disambiguated by the visual input are highlighted in bold.
  • Figure 4: User interaction interface for SwissADT. We use Streamlit and Docker to implement the user interaction platform.