Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

He Wang; Pengcheng Guo; Xucheng Wan; Huan Zhou; Lei Xie

Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

He Wang, Pengcheng Guo, Xucheng Wan, Huan Zhou, Lei Xie

TL;DR

This work targets lip-reading in real-world Chinese settings by integrating multi-scale lip video data with multiple visual encoders. It introduces an Enhanced ResNet3D front-end and employs Branchformer and E-Branchformer as backbones, coupled with multi-system fusion via ROVER to combine transcripts. Empirical results on the ChatCLR Task 2 dataset show a CER reduction of 21.52% over the official baseline and a second-place ranking, validating the benefit of scale-diverse data and encoder diversity. The approach demonstrates the value of flexible lip-region extraction, 3D visual feature modeling, and transcript-level ensemble for robust visual speech recognition.

Abstract

Automatic lip-reading (ALR) aims to automatically transcribe spoken content from a speaker's silent lip motion captured in video. Current mainstream lip-reading approaches only use a single visual encoder to model input videos of a single scale. In this paper, we propose to enhance lip-reading by incorporating multi-scale video data and multi-encoder. Specifically, we first propose a novel multi-scale lip motion extraction algorithm based on the size of the speaker's face and an Enhanced ResNet3D visual front-end (VFE) to extract lip features at different scales. For the multi-encoder, in addition to the mainstream Transformer and Conformer, we also incorporate the recently proposed Branchformer and E-Branchformer as visual encoders. In the experiments, we explore the influence of different video data scales and encoders on ALR system performance and fuse the texts transcribed by all ALR systems using recognizer output voting error reduction (ROVER). Finally, our proposed approach placed second in the ICME 2024 ChatCLR Challenge Task 2, with a 21.52% reduction in character error rate (CER) compared to the official baseline on the evaluation set.

Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

TL;DR

Abstract

Paper Structure (14 sections, 3 equations, 3 figures, 2 tables)

This paper contains 14 sections, 3 equations, 3 figures, 2 tables.

introduction
Method
Multi-Scale Lip Video Data Extraction
Enhanced ResNet3D Visual Front-end
Multi-System Building and Fusion
Experiment
Data Processing
Implementation Details
Main Results and Analysis
Which visual encoder performs best?
Which data scale is most suitable?
How much gain does multi-system fusion bring?
Ablation Study
Conclusion

Figures (3)

Figure 1: Examples of multi-scale lip motion videos of speaker S217 (top) and S443 (bottom) from the ChatCLR training set.
Figure 2: Detailed structures of the proposed Enhanced ResNet3D visual front-end (a) and its basic block (b).
Figure 3: Block diagram of the proposed multi-system fusion approach for automatic lip-reading.

Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

TL;DR

Abstract

Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

Authors

TL;DR

Abstract

Table of Contents

Figures (3)