Table of Contents
Fetching ...

Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides

Jinghua Zhao, Yuhang Jia, Shiyao Wang, Jiaming Zhou, Hui Wang, Yong Qin

TL;DR

This paper addresses the limitations of existing AVSR approaches that rely solely on audio or lip-reading by introducing Chinese-LiPS, a high-quality Chinese AVSR dataset that jointly includes lip-reading and presentation slides. Building on this dataset, the authors propose LiPS-AVSR, a pipeline that fuses lip-reading, OCR-derived slide text, and semantic cues from slides (via InternVL2) with audio features within the Whisper framework, using gated cross-attention and prompt-based guidance. Empirical results show that lip-reading provides an ~8% relative improvement, slides provide ~25%, and combining all modalities yields up to ~35% improvement in CER, with a best CER of 2.58% on Chinese-LiPS. These findings highlight the complementary roles of articulation cues and semantic slide content for robust, domain-specific AVSR in educational contexts, and the dataset is publicly available at the provided URL.

Abstract

Incorporating visual modalities to assist Automatic Speech Recognition (ASR) tasks has led to significant improvements. However, existing Audio-Visual Speech Recognition (AVSR) datasets and methods typically rely solely on lip-reading information or speaking contextual video, neglecting the potential of combining these different valuable visual cues within the speaking context. In this paper, we release a multimodal Chinese AVSR dataset, Chinese-LiPS, comprising 100 hours of speech, video, and corresponding manual transcription, with the visual modality encompassing both lip-reading information and the presentation slides used by the speaker. Based on Chinese-LiPS, we develop a simple yet effective pipeline, LiPS-AVSR, which leverages both lip-reading and presentation slide information as visual modalities for AVSR tasks. Experiments show that lip-reading and presentation slide information improve ASR performance by approximately 8\% and 25\%, respectively, with a combined performance improvement of about 35\%. The dataset is available at https://kiri0824.github.io/Chinese-LiPS/

Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides

TL;DR

This paper addresses the limitations of existing AVSR approaches that rely solely on audio or lip-reading by introducing Chinese-LiPS, a high-quality Chinese AVSR dataset that jointly includes lip-reading and presentation slides. Building on this dataset, the authors propose LiPS-AVSR, a pipeline that fuses lip-reading, OCR-derived slide text, and semantic cues from slides (via InternVL2) with audio features within the Whisper framework, using gated cross-attention and prompt-based guidance. Empirical results show that lip-reading provides an ~8% relative improvement, slides provide ~25%, and combining all modalities yields up to ~35% improvement in CER, with a best CER of 2.58% on Chinese-LiPS. These findings highlight the complementary roles of articulation cues and semantic slide content for robust, domain-specific AVSR in educational contexts, and the dataset is publicly available at the provided URL.

Abstract

Incorporating visual modalities to assist Automatic Speech Recognition (ASR) tasks has led to significant improvements. However, existing Audio-Visual Speech Recognition (AVSR) datasets and methods typically rely solely on lip-reading information or speaking contextual video, neglecting the potential of combining these different valuable visual cues within the speaking context. In this paper, we release a multimodal Chinese AVSR dataset, Chinese-LiPS, comprising 100 hours of speech, video, and corresponding manual transcription, with the visual modality encompassing both lip-reading information and the presentation slides used by the speaker. Based on Chinese-LiPS, we develop a simple yet effective pipeline, LiPS-AVSR, which leverages both lip-reading and presentation slide information as visual modalities for AVSR tasks. Experiments show that lip-reading and presentation slide information improve ASR performance by approximately 8\% and 25\%, respectively, with a combined performance improvement of about 35\%. The dataset is available at https://kiri0824.github.io/Chinese-LiPS/

Paper Structure

This paper contains 14 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of slide styles and themes across different topics in the Chinese-LiPS dataset: we display examples from eight specific topics and the content included in each, while the 'Others' topic covers diverse topics such as dance, fashion, cuisine, photography, etc.
  • Figure 2: Distribution of total recording duration by topic.
  • Figure 3: Distribution analysis of Chinese-LiPS dataset.
  • Figure 4: LiPS-AVSR pipeline.
  • Figure 5: Error correction examples using slide and lip-reading information. (A) Lip-reading mitigates hesitation and filler errors, while slide data addresses domain-specific terms. (B) OCR fails to capture visual cues, but InternVL2 effectively extracts meaningful context.