Table of Contents
Fetching ...

Cascade-Free Mandarin Visual Speech Recognition via Semantic-Guided Cross-Representation Alignment

Lei Yang, Yi He, Fei Wu, Shilin Wang

Abstract

Chinese mandarin visual speech recognition (VSR) is a task that has advanced in recent years, yet still lags behind the performance on non-tonal languages such as English. One primary challenge arises from the tonal nature of Mandarin, which limits the effectiveness of conventional sequence-to-sequence modeling approaches. To alleviate this issue, existing Chinese VSR systems commonly incorporate intermediate representations, most notably pinyin, within cascade architectures to enhance recognition accuracy. While beneficial, in these cascaded designs, the subsequent stage during inference depends on the output of the preceding stage, leading to error accumulation and increased inference latency. To address these limitations, we propose a cascade-free architecture based on multitask learning that jointly integrates multiple intermediate representations, including phoneme and viseme, to better exploit contextual information. The proposed semantic-guided local contrastive loss temporally aligns the features, enabling on-demand activation during inference, thereby providing a trade-off between inference efficiency and performance while mitigating error accumulation caused by projection and re-embedding. Experiments conducted on publicly available datasets demonstrate that our method achieves superior recognition performance.

Cascade-Free Mandarin Visual Speech Recognition via Semantic-Guided Cross-Representation Alignment

Abstract

Chinese mandarin visual speech recognition (VSR) is a task that has advanced in recent years, yet still lags behind the performance on non-tonal languages such as English. One primary challenge arises from the tonal nature of Mandarin, which limits the effectiveness of conventional sequence-to-sequence modeling approaches. To alleviate this issue, existing Chinese VSR systems commonly incorporate intermediate representations, most notably pinyin, within cascade architectures to enhance recognition accuracy. While beneficial, in these cascaded designs, the subsequent stage during inference depends on the output of the preceding stage, leading to error accumulation and increased inference latency. To address these limitations, we propose a cascade-free architecture based on multitask learning that jointly integrates multiple intermediate representations, including phoneme and viseme, to better exploit contextual information. The proposed semantic-guided local contrastive loss temporally aligns the features, enabling on-demand activation during inference, thereby providing a trade-off between inference efficiency and performance while mitigating error accumulation caused by projection and re-embedding. Experiments conducted on publicly available datasets demonstrate that our method achieves superior recognition performance.
Paper Structure (14 sections, 11 equations, 3 figures, 7 tables)

This paper contains 14 sections, 11 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: The error accumulation in cascade structures during inference. The error of predicted pinyin affects the prediction of characters, even with visual features in shortcut connection or cross level attention.
  • Figure 2: The comparison between multi-stage methods versus cascade-free method. 1) multi-stage method, 2) multi-stage method with shortcut connection, 3) cascade-free method.
  • Figure 3: The overall architecture of the proposed method. 1) the training pipeline, F for feed-forward, C for convolution, A for self attention, X for cross attention. 2) the inference pipeline, the phoneme and viseme encoder could perform on-demand reasoning for recognition performance and interpretability. 3) the semantic-guided local contrastive loss, which aligns the intermediate representation temporally with semantic guide.