Table of Contents
Fetching ...

Empowering the Deaf and Hard of Hearing Community: Enhancing Video Captions Using Large Language Models

Nadeen Fathallah, Monika Bhole, Steffen Staab

TL;DR

The paper tackles the DHH accessibility challenge posed by imperfect video captions by introducing an LLM-based correction pipeline that refines ASR captions generated by YouTube. It systematically compares GPT-3.5 and Llama2-13B, demonstrates a substantial reduction in Word Error Rate from $23.07\%$ to $9.75\%$ with ChatGPT-3.5, and reports strong improvements in BLEU and ROUGE metrics, signaling increased caption accuracy and coherence. A new 52-video dataset reflecting real-world captioning challenges is introduced, with manual ground-truth captions serving as the benchmark. Limitations include sensitivity to intonation and cultural references, motivating future work on multi-modal LLMs, code-switching, and deployment on other platforms and in low-resource settings to broaden accessibility.

Abstract

In today's digital age, video content is prevalent, serving as a primary source of information, education, and entertainment. However, the Deaf and Hard of Hearing (DHH) community often faces significant challenges in accessing video content due to the inadequacy of automatic speech recognition (ASR) systems in providing accurate and reliable captions. This paper addresses the urgent need to improve video caption quality by leveraging Large Language Models (LLMs). We present a comprehensive study that explores the integration of LLMs to enhance the accuracy and context-awareness of captions generated by ASR systems. Our methodology involves a novel pipeline that corrects ASR-generated captions using advanced LLMs. It explicitly focuses on models like GPT-3.5 and Llama2-13B due to their robust performance in language comprehension and generation tasks. We introduce a dataset representative of real-world challenges the DHH community faces to evaluate our proposed pipeline. Our results indicate that LLM-enhanced captions significantly improve accuracy, as evidenced by a notably lower Word Error Rate (WER) achieved by ChatGPT-3.5 (WER: 9.75%) compared to the original ASR captions (WER: 23.07%), ChatGPT-3.5 shows an approximate 57.72% improvement in WER compared to the original ASR captions.

Empowering the Deaf and Hard of Hearing Community: Enhancing Video Captions Using Large Language Models

TL;DR

The paper tackles the DHH accessibility challenge posed by imperfect video captions by introducing an LLM-based correction pipeline that refines ASR captions generated by YouTube. It systematically compares GPT-3.5 and Llama2-13B, demonstrates a substantial reduction in Word Error Rate from to with ChatGPT-3.5, and reports strong improvements in BLEU and ROUGE metrics, signaling increased caption accuracy and coherence. A new 52-video dataset reflecting real-world captioning challenges is introduced, with manual ground-truth captions serving as the benchmark. Limitations include sensitivity to intonation and cultural references, motivating future work on multi-modal LLMs, code-switching, and deployment on other platforms and in low-resource settings to broaden accessibility.

Abstract

In today's digital age, video content is prevalent, serving as a primary source of information, education, and entertainment. However, the Deaf and Hard of Hearing (DHH) community often faces significant challenges in accessing video content due to the inadequacy of automatic speech recognition (ASR) systems in providing accurate and reliable captions. This paper addresses the urgent need to improve video caption quality by leveraging Large Language Models (LLMs). We present a comprehensive study that explores the integration of LLMs to enhance the accuracy and context-awareness of captions generated by ASR systems. Our methodology involves a novel pipeline that corrects ASR-generated captions using advanced LLMs. It explicitly focuses on models like GPT-3.5 and Llama2-13B due to their robust performance in language comprehension and generation tasks. We introduce a dataset representative of real-world challenges the DHH community faces to evaluate our proposed pipeline. Our results indicate that LLM-enhanced captions significantly improve accuracy, as evidenced by a notably lower Word Error Rate (WER) achieved by ChatGPT-3.5 (WER: 9.75%) compared to the original ASR captions (WER: 23.07%), ChatGPT-3.5 shows an approximate 57.72% improvement in WER compared to the original ASR captions.

Paper Structure

This paper contains 17 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Example of traditional ASR systems (YouTube's automatic video captioning feature) inaccurate captions. In figure (a) ASR system generates "Koreans canin" instead of "Koreans can", figure (b) "triy" instead of "trial", and figure (c) "business acen" instead of "business acumen".
  • Figure 2: Selection process of Large Language Models (LLMs) for improving the quality of captions generated by ASR system. The figure illustrates the performance of different LLMs (GPT-2, T5, Llama2-13B, GPT-3.5) in correcting a sample video caption while maintaining the original word sequence.
  • Figure 3: Pipeline - Our proposed caption correction pipeline leveraging LLMs. The input is the ASR system-generated caption (text), shown on the left, which includes errors highlighted in red. The output is the LLM-corrected caption (text), shown on the right, where the corrections are highlighted in green.
  • Figure 4: Preliminary experiments to evaluate LLMs' performance on code-switching challenges in ASR video captions demonstrate that Llama-2-13B and ChatGPT-3.5 successfully corrected the caption, showcasing their ability to handle mixed-language inputs and produce accurate English captions.