Table of Contents
Fetching ...

UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

Jinting Wang, Shan Yang, Chenxing Li, Dong Yu, Li Liu

TL;DR

UniCUE introduces the first unified framework for directly generating speech from Chinese Cued Speech videos by transferring visual understanding from CSR to diffusion-based CSV2S. It integrates a pose-aware visual processor, a semantic alignment pool, and a VisioPhonetic adapter to produce cue-specific, temporally synchronized speech without relying on intermediate text. The model achieves state-of-the-art results on the UniCUE-HI corpus, with notable improvements in Word Error Rate, synchronization metrics, and speech quality, while delivering faster inference than modular CSR+TTS pipelines. The UniCUE-HI dataset, featuring both hearing-impaired and normal-hearing cuers, supports robust evaluation and downstream accessibility applications for hearing-impaired users.

Abstract

Cued Speech (CS) enhances lipreading via hand coding, offering visual phonemic cues that support precise speech perception for the hearing-impaired. The task of CS Video-to-Speech generation (CSV2S) aims to convert CS videos into intelligible speech signals. Most existing research focuses on CS Recognition (CSR), which transcribes video content into text. Consequently, a common solution for CSV2S is to integrate CSR with a text-to-speech (TTS) system. However, this pipeline relies on text as an intermediate medium, which may lead to error propagation and temporal misalignment between speech and CS video dynamics. In contrast, directly generating audio speech from CS video (direct CSV2S) often suffers from the inherent multimodal complexity and the limited availability of CS data. To address these challenges, we propose UniCUE, the first unified framework for CSV2S that directly generates speech from CS videos without relying on intermediate text. The core innovation of UniCUE lies in integrating an understanding task (CSR) that provides fine-grained CS visual-semantic cues to guide speech generation. Specifically, UniCUE incorporates a pose-aware visual processor, a semantic alignment pool that enables precise visual-semantic mapping, and a VisioPhonetic adapter to bridge the understanding and generation tasks within a unified architecture. To support this framework, we construct UniCUE-HI, a large-scale Mandarin CS dataset containing 11282 videos from 14 cuers, including both hearing-impaired and normal-hearing individuals. Extensive experiments on this dataset demonstrate that UniCUE achieves state-of-the-art performance across multiple evaluation metrics.

UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

TL;DR

UniCUE introduces the first unified framework for directly generating speech from Chinese Cued Speech videos by transferring visual understanding from CSR to diffusion-based CSV2S. It integrates a pose-aware visual processor, a semantic alignment pool, and a VisioPhonetic adapter to produce cue-specific, temporally synchronized speech without relying on intermediate text. The model achieves state-of-the-art results on the UniCUE-HI corpus, with notable improvements in Word Error Rate, synchronization metrics, and speech quality, while delivering faster inference than modular CSR+TTS pipelines. The UniCUE-HI dataset, featuring both hearing-impaired and normal-hearing cuers, supports robust evaluation and downstream accessibility applications for hearing-impaired users.

Abstract

Cued Speech (CS) enhances lipreading via hand coding, offering visual phonemic cues that support precise speech perception for the hearing-impaired. The task of CS Video-to-Speech generation (CSV2S) aims to convert CS videos into intelligible speech signals. Most existing research focuses on CS Recognition (CSR), which transcribes video content into text. Consequently, a common solution for CSV2S is to integrate CSR with a text-to-speech (TTS) system. However, this pipeline relies on text as an intermediate medium, which may lead to error propagation and temporal misalignment between speech and CS video dynamics. In contrast, directly generating audio speech from CS video (direct CSV2S) often suffers from the inherent multimodal complexity and the limited availability of CS data. To address these challenges, we propose UniCUE, the first unified framework for CSV2S that directly generates speech from CS videos without relying on intermediate text. The core innovation of UniCUE lies in integrating an understanding task (CSR) that provides fine-grained CS visual-semantic cues to guide speech generation. Specifically, UniCUE incorporates a pose-aware visual processor, a semantic alignment pool that enables precise visual-semantic mapping, and a VisioPhonetic adapter to bridge the understanding and generation tasks within a unified architecture. To support this framework, we construct UniCUE-HI, a large-scale Mandarin CS dataset containing 11282 videos from 14 cuers, including both hearing-impaired and normal-hearing individuals. Extensive experiments on this dataset demonstrate that UniCUE achieves state-of-the-art performance across multiple evaluation metrics.

Paper Structure

This paper contains 23 sections, 7 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Illustration of the rules of the Chinese CS system and the proposed framework (UniCUE). (a) The chart for Mandarin Chinese CS (figure from liu2019pilot), where five distinct hand positions are used to encode vowels, and eight finger shapes are employed to represent consonants in Mandarin Chinese. (b) Our framework enables the direct generation of synchronized natural speech from video.
  • Figure 2: (a) The combined CSV2S architecture combines separately trained CSR and TTS models. (b) Our unified framework (UniCUE) that transfers understanding capabilities of CSR into speech generation training by integrating the visual processor of CSR into CSV2S.
  • Figure 3: Overview of our unified framework (UniCUE). It achieves direct Chinese CSV2S generation with semantic consistency, temporal alignment, and characteristics coherence by aligning the fine-grained spatiotemporal visual representations of CSR with the diffusion-based speech generator. The framework consists of three core modules: (1) Pose-Aware Visual Processor: Integrates video and pose embeddings to perform fine-grained spatiotemporal modeling of lip and hand movements. (2) Semantic Alignment Pool: Enhances the semantic mapping between visual features and speech content through video-text and pose-text contrastive learning. (3) VisioPhonetic Adapter (VPA): Converts fine-grained visual representation of CSR into condition encodings compatible with the diffusion-based generator.
  • Figure 4: The details of the VisioPhonetic Adapter, which transforms semantic visual embeddings into phonetic-aware features to enable seamless conditioning for diffusion-based speech synthesis.
  • Figure 5: User study results for accuracy, quality, and synchronization metrics on normal-hearing (a) and hearing-impaired (b) test samples.
  • ...and 7 more figures