Table of Contents
Fetching ...

Cued Speech Generation Leveraging a Pre-trained Audiovisual Text-to-Speech Model

Sanjana Sankar, Martin Lenglet, Gerard Bailly, Denis Beautemps, Thomas Hueber

TL;DR

This work tackles automatic generation of Cued Speech by reprogramming an audiovisual text-to-speech model (AVTacotron2) to predict hand and lip poses from text, addressing data scarcity in French CS with a newly recorded high-quality dataset (CSF23) and exploring transfer-learning strategies. The approach leverages a pretrained AV-TTS encoder, tests three training strategies, and evaluates outputs with an automatic cued-speech recognition system, achieving up to ~77% phonetic accuracy. The results show that freezing the pretrained encoder and finetuning the model on CS data yields the best performance, demonstrating effective knowledge transfer from AV-TTS to ACSG and potential for broader audiovisual cue generation. The study sets the stage for future avatar-based or photorealistic CS synthesis using generative models, advancing practical communication aids for the Deaf and hard-of-hearing communities.

Abstract

This paper presents a novel approach for the automatic generation of Cued Speech (ACSG), a visual communication system used by people with hearing impairment to better elicit the spoken language. We explore transfer learning strategies by leveraging a pre-trained audiovisual autoregressive text-to-speech model (AVTacotron2). This model is reprogrammed to infer Cued Speech (CS) hand and lip movements from text input. Experiments are conducted on two publicly available datasets, including one recorded specifically for this study. Performance is assessed using an automatic CS recognition system. With a decoding accuracy at the phonetic level reaching approximately 77%, the results demonstrate the effectiveness of our approach.

Cued Speech Generation Leveraging a Pre-trained Audiovisual Text-to-Speech Model

TL;DR

This work tackles automatic generation of Cued Speech by reprogramming an audiovisual text-to-speech model (AVTacotron2) to predict hand and lip poses from text, addressing data scarcity in French CS with a newly recorded high-quality dataset (CSF23) and exploring transfer-learning strategies. The approach leverages a pretrained AV-TTS encoder, tests three training strategies, and evaluates outputs with an automatic cued-speech recognition system, achieving up to ~77% phonetic accuracy. The results show that freezing the pretrained encoder and finetuning the model on CS data yields the best performance, demonstrating effective knowledge transfer from AV-TTS to ACSG and potential for broader audiovisual cue generation. The study sets the stage for future avatar-based or photorealistic CS synthesis using generative models, advancing practical communication aids for the Deaf and hard-of-hearing communities.

Abstract

This paper presents a novel approach for the automatic generation of Cued Speech (ACSG), a visual communication system used by people with hearing impairment to better elicit the spoken language. We explore transfer learning strategies by leveraging a pre-trained audiovisual autoregressive text-to-speech model (AVTacotron2). This model is reprogrammed to infer Cued Speech (CS) hand and lip movements from text input. Experiments are conducted on two publicly available datasets, including one recorded specifically for this study. Performance is assessed using an automatic CS recognition system. With a decoding accuracy at the phonetic level reaching approximately 77%, the results demonstrate the effectiveness of our approach.
Paper Structure (10 sections, 3 figures)

This paper contains 10 sections, 3 figures.

Figures (3)

  • Figure 1: Proposed 2-step framework for the automatic generation of cued-speech from text. The present work focuses on the highlighted green part.
  • Figure 2: The ACSG architecture is a modified AVTacotron2 with an additional regression layer to generate both mel-spectrogram and the CS articulators (hand and lips) and the ACSR architecture is the pre-trained model from sankar22_icassp used for evaluating the generated features.
  • Figure 3: Column (A) shows the original frame from the CS video. Columns (B) and (C) show the generated (in green) and expected (in red) hand and lips features resp.