MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production
Jian Ma, Wenguan Wang, Yi Yang, Feng Zheng
TL;DR
MS2SL presents a diffusion-based, multimodal framework to generate continuous sign language sequences directly from spoken text or speech by leveraging a joint text–audio–sign embedding space with embedding-consistency learning. The method combines text and audio encoders (CLIP and HuBERT) with a sign-focused diffusion generator and a sign predictor, while ECL enables training when one or more modalities are missing. Empirical results on How2Sign and PHOENIX14T show state-of-the-art performance for both text-to-sign and audio-to-sign generation, with ablations demonstrating the value of multi-modality, cycle-consistency, and diffusion. The work advances accessible communication for sign-language users and contributes a scalable framework for cross-modal continuous sign production, though challenges remain for long sequences and nuanced sign articulation.
Abstract
Sign language understanding has made significant strides; however, there is still no viable solution for generating sign sequences directly from entire spoken content, e.g., text or speech. In this paper, we propose a unified framework for continuous sign language production, easing communication between sign and non-sign language users. In particular, a sequence diffusion model, utilizing embeddings extracted from text or speech, is crafted to generate sign predictions step by step. Moreover, by creating a joint embedding space for text, audio, and sign, we bind these modalities and leverage the semantic consistency among them to provide informative feedback for the model training. This embedding-consistency learning strategy minimizes the reliance on sign triplets and ensures continuous model refinement, even with a missing audio modality. Experiments on How2Sign and PHOENIX14T datasets demonstrate that our model achieves competitive performance in sign language production.
