Table of Contents
Fetching ...

MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production

Jian Ma, Wenguan Wang, Yi Yang, Feng Zheng

TL;DR

MS2SL presents a diffusion-based, multimodal framework to generate continuous sign language sequences directly from spoken text or speech by leveraging a joint text–audio–sign embedding space with embedding-consistency learning. The method combines text and audio encoders (CLIP and HuBERT) with a sign-focused diffusion generator and a sign predictor, while ECL enables training when one or more modalities are missing. Empirical results on How2Sign and PHOENIX14T show state-of-the-art performance for both text-to-sign and audio-to-sign generation, with ablations demonstrating the value of multi-modality, cycle-consistency, and diffusion. The work advances accessible communication for sign-language users and contributes a scalable framework for cross-modal continuous sign production, though challenges remain for long sequences and nuanced sign articulation.

Abstract

Sign language understanding has made significant strides; however, there is still no viable solution for generating sign sequences directly from entire spoken content, e.g., text or speech. In this paper, we propose a unified framework for continuous sign language production, easing communication between sign and non-sign language users. In particular, a sequence diffusion model, utilizing embeddings extracted from text or speech, is crafted to generate sign predictions step by step. Moreover, by creating a joint embedding space for text, audio, and sign, we bind these modalities and leverage the semantic consistency among them to provide informative feedback for the model training. This embedding-consistency learning strategy minimizes the reliance on sign triplets and ensures continuous model refinement, even with a missing audio modality. Experiments on How2Sign and PHOENIX14T datasets demonstrate that our model achieves competitive performance in sign language production.

MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production

TL;DR

MS2SL presents a diffusion-based, multimodal framework to generate continuous sign language sequences directly from spoken text or speech by leveraging a joint text–audio–sign embedding space with embedding-consistency learning. The method combines text and audio encoders (CLIP and HuBERT) with a sign-focused diffusion generator and a sign predictor, while ECL enables training when one or more modalities are missing. Empirical results on How2Sign and PHOENIX14T show state-of-the-art performance for both text-to-sign and audio-to-sign generation, with ablations demonstrating the value of multi-modality, cycle-consistency, and diffusion. The work advances accessible communication for sign-language users and contributes a scalable framework for cross-modal continuous sign production, though challenges remain for long sequences and nuanced sign articulation.

Abstract

Sign language understanding has made significant strides; however, there is still no viable solution for generating sign sequences directly from entire spoken content, e.g., text or speech. In this paper, we propose a unified framework for continuous sign language production, easing communication between sign and non-sign language users. In particular, a sequence diffusion model, utilizing embeddings extracted from text or speech, is crafted to generate sign predictions step by step. Moreover, by creating a joint embedding space for text, audio, and sign, we bind these modalities and leverage the semantic consistency among them to provide informative feedback for the model training. This embedding-consistency learning strategy minimizes the reliance on sign triplets and ensures continuous model refinement, even with a missing audio modality. Experiments on How2Sign and PHOENIX14T datasets demonstrate that our model achieves competitive performance in sign language production.
Paper Structure (15 sections, 8 equations, 3 figures, 7 tables)

This paper contains 15 sections, 8 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Illustration of our sign language producer. 1) We propose a unified, multimodal spoken data-driven framework for SLP that can directly produce sign sequences from spoken text or speech audio. 2) To overcome data scarcity, we train a joint embedding space through the spontaneous alignment of multimodal data. Within this space, we establish a consistency learning strategy to provide feedback signals that boost training.
  • Figure 2: Overview of our framework for MS2SL. It includes three key components: sign predictor (§\ref{['sec:SP']}), modality binding (§\ref{['sec:MB']}) and ECL strategy (§\ref{['sec:ECL']}). MS2SL directly unifies spoken content from different modalities into a common sign language production framework. The introduction of the joint embedding space and ECL reduces the reliance on co-occurring (text, audio, sign) triplet.
  • Figure 3: Results examples (§\ref{['sec:CS']}):Left column: text-to-sign generation stream, right column: audio-to-sign generation stream. Under given conditions, our MS2SL can generate signs that are more semantically consistent with the spoken description and have more precise keypoints.