Cued Speech Generation Leveraging a Pre-trained Audiovisual Text-to-Speech Model
Sanjana Sankar, Martin Lenglet, Gerard Bailly, Denis Beautemps, Thomas Hueber
TL;DR
This work tackles automatic generation of Cued Speech by reprogramming an audiovisual text-to-speech model (AVTacotron2) to predict hand and lip poses from text, addressing data scarcity in French CS with a newly recorded high-quality dataset (CSF23) and exploring transfer-learning strategies. The approach leverages a pretrained AV-TTS encoder, tests three training strategies, and evaluates outputs with an automatic cued-speech recognition system, achieving up to ~77% phonetic accuracy. The results show that freezing the pretrained encoder and finetuning the model on CS data yields the best performance, demonstrating effective knowledge transfer from AV-TTS to ACSG and potential for broader audiovisual cue generation. The study sets the stage for future avatar-based or photorealistic CS synthesis using generative models, advancing practical communication aids for the Deaf and hard-of-hearing communities.
Abstract
This paper presents a novel approach for the automatic generation of Cued Speech (ACSG), a visual communication system used by people with hearing impairment to better elicit the spoken language. We explore transfer learning strategies by leveraging a pre-trained audiovisual autoregressive text-to-speech model (AVTacotron2). This model is reprogrammed to infer Cued Speech (CS) hand and lip movements from text input. Experiments are conducted on two publicly available datasets, including one recorded specifically for this study. Performance is assessed using an automatic CS recognition system. With a decoding accuracy at the phonetic level reaching approximately 77%, the results demonstrate the effectiveness of our approach.
