SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition
Hao Wang, Shuhei Kurita, Shuichiro Shimizu, Daisuke Kawahara
TL;DR
SlideAVSR addresses the need to evaluate image-text understanding in AVSR beyond lip-reading by constructing a dataset of scientific paper explanation videos. It introduces DocWhisper, an OCR-enabled AVSR approach that uses slide text as prompts to a Whisper model, and a FQ Ranker to mitigate long-tail OCR outputs. Experiments show DocWhisper yields up to 14.3% absolute improvement on TestA and 11% on TestB over audio-only Whisper, demonstrating the practical value of slide-text context. The dataset construction employs multi-stage filtering, cleansing, and accent-aware partitioning to produce roughly 36 hours of data from 245 videos, with discussions of limitations and directions toward end-to-end AVSR and broader video domains.
Abstract
Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio. In AVSR, considerable efforts have been directed at datasets for facial features such as lip-readings, while they often fall short in evaluating the image comprehension capabilities in broader contexts. In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos. SlideAVSR provides a new benchmark where models transcribe speech utterances with texts on the slides on the presentation recordings. As technical terminologies that are frequent in paper explanations are notoriously challenging to transcribe without reference texts, our SlideAVSR dataset spotlights a new aspect of AVSR problems. As a simple yet effective baseline, we propose DocWhisper, an AVSR model that can refer to textual information from slides, and confirm its effectiveness on SlideAVSR.
