Table of Contents
Fetching ...

Bridge to Non-Barrier Communication: Gloss-Prompted Fine-grained Cued Speech Gesture Generation with Diffusion Model

Wentao Lei, Li Liu, Jun Wang

TL;DR

GlossDiff tackles the challenge of generating fine-grained Mandarin Chinese Cued Speech gestures from audio and text under limited data by introducing gloss-based bridging and rhythm-aware diffusion. It combines a Knowledge Infusion Module that converts language into a direct CS gloss, a Gloss-Prompted Diffusion Generator conditioned by AdaIN with a gloss embedding, and an Audio-driven Rhythmic Module that aligns gesture rhythm with speech using a WavLM-based rhythm encoder. A novel CLIP-based fine-tuning of MotionCLIP and a new rhythm metric GAD enable robust, asynchronous lip-hand gesture synthesis, and the authors release MCCS, the first large-scale MCCS dataset with four cuers. Quantitative and qualitative results on MCCS show state-of-the-art performance across key metrics (PCK, MAJE, MAD, GAD), supported by ablations and user studies. This work advances non-barrier communication by enabling precise, rhythm-aware CS gesture generation that tightly links language, audio, and gesture semantics.

Abstract

Cued Speech (CS) is an advanced visual phonetic encoding system that integrates lip reading with hand codings, enabling people with hearing impairments to communicate efficiently. CS video generation aims to produce specific lip and gesture movements of CS from audio or text inputs. The main challenge is that given limited CS data, we strive to simultaneously generate fine-grained hand and finger movements, as well as lip movements, meanwhile the two kinds of movements need to be asynchronously aligned. Existing CS generation methods are fragile and prone to poor performance due to template-based statistical models and careful hand-crafted pre-processing to fit the models. Therefore, we propose a novel Gloss-prompted Diffusion-based CS Gesture generation framework (called GlossDiff). Specifically, to integrate additional linguistic rules knowledge into the model. we first introduce a bridging instruction called \textbf{Gloss}, which is an automatically generated descriptive text to establish a direct and more delicate semantic connection between spoken language and CS gestures. Moreover, we first suggest rhythm is an important paralinguistic feature for CS to improve the communication efficacy. Therefore, we propose a novel Audio-driven Rhythmic Module (ARM) to learn rhythm that matches audio speech. Moreover, in this work, we design, record, and publish the first Chinese CS dataset with four CS cuers. Extensive experiments demonstrate that our method quantitatively and qualitatively outperforms current state-of-the-art (SOTA) methods. We release the code and data at https://glossdiff.github.io/.

Bridge to Non-Barrier Communication: Gloss-Prompted Fine-grained Cued Speech Gesture Generation with Diffusion Model

TL;DR

GlossDiff tackles the challenge of generating fine-grained Mandarin Chinese Cued Speech gestures from audio and text under limited data by introducing gloss-based bridging and rhythm-aware diffusion. It combines a Knowledge Infusion Module that converts language into a direct CS gloss, a Gloss-Prompted Diffusion Generator conditioned by AdaIN with a gloss embedding, and an Audio-driven Rhythmic Module that aligns gesture rhythm with speech using a WavLM-based rhythm encoder. A novel CLIP-based fine-tuning of MotionCLIP and a new rhythm metric GAD enable robust, asynchronous lip-hand gesture synthesis, and the authors release MCCS, the first large-scale MCCS dataset with four cuers. Quantitative and qualitative results on MCCS show state-of-the-art performance across key metrics (PCK, MAJE, MAD, GAD), supported by ablations and user studies. This work advances non-barrier communication by enabling precise, rhythm-aware CS gesture generation that tightly links language, audio, and gesture semantics.

Abstract

Cued Speech (CS) is an advanced visual phonetic encoding system that integrates lip reading with hand codings, enabling people with hearing impairments to communicate efficiently. CS video generation aims to produce specific lip and gesture movements of CS from audio or text inputs. The main challenge is that given limited CS data, we strive to simultaneously generate fine-grained hand and finger movements, as well as lip movements, meanwhile the two kinds of movements need to be asynchronously aligned. Existing CS generation methods are fragile and prone to poor performance due to template-based statistical models and careful hand-crafted pre-processing to fit the models. Therefore, we propose a novel Gloss-prompted Diffusion-based CS Gesture generation framework (called GlossDiff). Specifically, to integrate additional linguistic rules knowledge into the model. we first introduce a bridging instruction called \textbf{Gloss}, which is an automatically generated descriptive text to establish a direct and more delicate semantic connection between spoken language and CS gestures. Moreover, we first suggest rhythm is an important paralinguistic feature for CS to improve the communication efficacy. Therefore, we propose a novel Audio-driven Rhythmic Module (ARM) to learn rhythm that matches audio speech. Moreover, in this work, we design, record, and publish the first Chinese CS dataset with four CS cuers. Extensive experiments demonstrate that our method quantitatively and qualitatively outperforms current state-of-the-art (SOTA) methods. We release the code and data at https://glossdiff.github.io/.
Paper Structure (28 sections, 10 equations, 6 figures, 1 table)

This paper contains 28 sections, 10 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: The details of CS rules and conversion process. (a) is the chart for the Mandarin Chinese Cued Speech (figure from [3]), where five different hand positions are used to code vowels, and eight finger shapes are used to code consonants in Mandarin Chinese. (b) shows the proposed instructional gloss, which directly links the text to the CS movements.
  • Figure 2: The overall framework of the proposed GlossDiff, where (a), (b), (c) represent the Knowledge Infusion Module, Audio Rhythmic Module and Diffusion-based generation module, respectively.
  • Figure 3: The visualization result of the generated gesture according to fine-grained Gloss. Better view by zooming in.
  • Figure 4: The visualization of t-SNE clustering for eight groups of consonants corresponding to finger shapes, and five groups of vowels corresponding to hand position. Each color represents a group of consonants or vowels.
  • Figure 5: The visualization result of the generated gestures compared to SOTA method. Better view by zooming in.
  • ...and 1 more figures