Table of Contents
Fetching ...

SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition

Hao Wang, Shuhei Kurita, Shuichiro Shimizu, Daisuke Kawahara

TL;DR

SlideAVSR addresses the need to evaluate image-text understanding in AVSR beyond lip-reading by constructing a dataset of scientific paper explanation videos. It introduces DocWhisper, an OCR-enabled AVSR approach that uses slide text as prompts to a Whisper model, and a FQ Ranker to mitigate long-tail OCR outputs. Experiments show DocWhisper yields up to 14.3% absolute improvement on TestA and 11% on TestB over audio-only Whisper, demonstrating the practical value of slide-text context. The dataset construction employs multi-stage filtering, cleansing, and accent-aware partitioning to produce roughly 36 hours of data from 245 videos, with discussions of limitations and directions toward end-to-end AVSR and broader video domains.

Abstract

Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio. In AVSR, considerable efforts have been directed at datasets for facial features such as lip-readings, while they often fall short in evaluating the image comprehension capabilities in broader contexts. In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos. SlideAVSR provides a new benchmark where models transcribe speech utterances with texts on the slides on the presentation recordings. As technical terminologies that are frequent in paper explanations are notoriously challenging to transcribe without reference texts, our SlideAVSR dataset spotlights a new aspect of AVSR problems. As a simple yet effective baseline, we propose DocWhisper, an AVSR model that can refer to textual information from slides, and confirm its effectiveness on SlideAVSR.

SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition

TL;DR

SlideAVSR addresses the need to evaluate image-text understanding in AVSR beyond lip-reading by constructing a dataset of scientific paper explanation videos. It introduces DocWhisper, an OCR-enabled AVSR approach that uses slide text as prompts to a Whisper model, and a FQ Ranker to mitigate long-tail OCR outputs. Experiments show DocWhisper yields up to 14.3% absolute improvement on TestA and 11% on TestB over audio-only Whisper, demonstrating the practical value of slide-text context. The dataset construction employs multi-stage filtering, cleansing, and accent-aware partitioning to produce roughly 36 hours of data from 245 videos, with discussions of limitations and directions toward end-to-end AVSR and broader video domains.

Abstract

Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio. In AVSR, considerable efforts have been directed at datasets for facial features such as lip-readings, while they often fall short in evaluating the image comprehension capabilities in broader contexts. In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos. SlideAVSR provides a new benchmark where models transcribe speech utterances with texts on the slides on the presentation recordings. As technical terminologies that are frequent in paper explanations are notoriously challenging to transcribe without reference texts, our SlideAVSR dataset spotlights a new aspect of AVSR problems. As a simple yet effective baseline, we propose DocWhisper, an AVSR model that can refer to textual information from slides, and confirm its effectiveness on SlideAVSR.
Paper Structure (28 sections, 6 figures, 6 tables)

This paper contains 28 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Construction flow of SlideAVSR.
  • Figure 2: Frequency distribution of the number of words in OCR results. While samples with over 500 words are present, they are omitted for brevity.
  • Figure :
  • Figure :
  • Figure :
  • ...and 1 more figures