UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

Jinting Wang; Shan Yang; Chenxing Li; Dong Yu; Li Liu

UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

Jinting Wang, Shan Yang, Chenxing Li, Dong Yu, Li Liu

TL;DR

UniCUE introduces the first unified framework for directly generating speech from Chinese Cued Speech videos by transferring visual understanding from CSR to diffusion-based CSV2S. It integrates a pose-aware visual processor, a semantic alignment pool, and a VisioPhonetic adapter to produce cue-specific, temporally synchronized speech without relying on intermediate text. The model achieves state-of-the-art results on the UniCUE-HI corpus, with notable improvements in Word Error Rate, synchronization metrics, and speech quality, while delivering faster inference than modular CSR+TTS pipelines. The UniCUE-HI dataset, featuring both hearing-impaired and normal-hearing cuers, supports robust evaluation and downstream accessibility applications for hearing-impaired users.

Abstract

Cued Speech (CS) enhances lipreading via hand coding, offering visual phonemic cues that support precise speech perception for the hearing-impaired. The task of CS Video-to-Speech generation (CSV2S) aims to convert CS videos into intelligible speech signals. Most existing research focuses on CS Recognition (CSR), which transcribes video content into text. Consequently, a common solution for CSV2S is to integrate CSR with a text-to-speech (TTS) system. However, this pipeline relies on text as an intermediate medium, which may lead to error propagation and temporal misalignment between speech and CS video dynamics. In contrast, directly generating audio speech from CS video (direct CSV2S) often suffers from the inherent multimodal complexity and the limited availability of CS data. To address these challenges, we propose UniCUE, the first unified framework for CSV2S that directly generates speech from CS videos without relying on intermediate text. The core innovation of UniCUE lies in integrating an understanding task (CSR) that provides fine-grained CS visual-semantic cues to guide speech generation. Specifically, UniCUE incorporates a pose-aware visual processor, a semantic alignment pool that enables precise visual-semantic mapping, and a VisioPhonetic adapter to bridge the understanding and generation tasks within a unified architecture. To support this framework, we construct UniCUE-HI, a large-scale Mandarin CS dataset containing 11282 videos from 14 cuers, including both hearing-impaired and normal-hearing individuals. Extensive experiments on this dataset demonstrate that UniCUE achieves state-of-the-art performance across multiple evaluation metrics.

UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

TL;DR

Abstract

UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)