Table of Contents
Fetching ...

Neural Sign Actors: A diffusion model for 3D sign language production from text

Vasileios Baltatzis, Rolandos Alexandros Potamias, Evangelos Ververas, Guanxiong Sun, Jiankang Deng, Stefanos Zafeiriou

TL;DR

This work tackles the challenge of producing realistic 3D sign language from text by introducing Neural Sign Actors, a diffusion-based SLP framework trained on a large-scale 4D signing dataset with SMPL-X avatars. It replaces intermediate gloss pipelines with a text-conditioned diffusion model that operates in a compact SMPL-X pose space, guided by an anatomically informed GNN, CLIP-based text embeddings, and an autoregressive decoder. A new 3D extension of the How2Sign dataset is created via a robust 4D reconstruction pipeline, enabling high-fidelity 3D signing and extensive evaluation, including a perceptual study with ASL users. The approach achieves state-of-the-art performance across reconstruction, articulation fidelity, and semantic back-translation, advancing realistic neural sign avatars and bridging communication for Deaf and Hard of Hearing communities.

Abstract

Sign Languages (SL) serve as the primary mode of communication for the Deaf and Hard of Hearing communities. Deep learning methods for SL recognition and translation have achieved promising results. However, Sign Language Production (SLP) poses a challenge as the generated motions must be realistic and have precise semantic meaning. Most SLP methods rely on 2D data, which hinders their realism. In this work, a diffusion-based SLP model is trained on a curated large-scale dataset of 4D signing avatars and their corresponding text transcripts. The proposed method can generate dynamic sequences of 3D avatars from an unconstrained domain of discourse using a diffusion process formed on a novel and anatomically informed graph neural network defined on the SMPL-X body skeleton. Through quantitative and qualitative experiments, we show that the proposed method considerably outperforms previous methods of SLP. This work makes an important step towards realistic neural sign avatars, bridging the communication gap between Deaf and hearing communities.

Neural Sign Actors: A diffusion model for 3D sign language production from text

TL;DR

This work tackles the challenge of producing realistic 3D sign language from text by introducing Neural Sign Actors, a diffusion-based SLP framework trained on a large-scale 4D signing dataset with SMPL-X avatars. It replaces intermediate gloss pipelines with a text-conditioned diffusion model that operates in a compact SMPL-X pose space, guided by an anatomically informed GNN, CLIP-based text embeddings, and an autoregressive decoder. A new 3D extension of the How2Sign dataset is created via a robust 4D reconstruction pipeline, enabling high-fidelity 3D signing and extensive evaluation, including a perceptual study with ASL users. The approach achieves state-of-the-art performance across reconstruction, articulation fidelity, and semantic back-translation, advancing realistic neural sign avatars and bridging communication for Deaf and Hard of Hearing communities.

Abstract

Sign Languages (SL) serve as the primary mode of communication for the Deaf and Hard of Hearing communities. Deep learning methods for SL recognition and translation have achieved promising results. However, Sign Language Production (SLP) poses a challenge as the generated motions must be realistic and have precise semantic meaning. Most SLP methods rely on 2D data, which hinders their realism. In this work, a diffusion-based SLP model is trained on a curated large-scale dataset of 4D signing avatars and their corresponding text transcripts. The proposed method can generate dynamic sequences of 3D avatars from an unconstrained domain of discourse using a diffusion process formed on a novel and anatomically informed graph neural network defined on the SMPL-X body skeleton. Through quantitative and qualitative experiments, we show that the proposed method considerably outperforms previous methods of SLP. This work makes an important step towards realistic neural sign avatars, bridging the communication gap between Deaf and hearing communities.
Paper Structure (14 sections, 9 equations, 7 figures, 3 tables)

This paper contains 14 sections, 9 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The proposed method takes raw text as input and generates a realistic and coherent motion of its corresponding sign language translation. From top to bottom: the input text, the ground truth sign language video (shown just for reference), and the generated motion.
  • Figure 2: Overview of the fitting pipeline. A set of input frames $F$ are first processed by OSX lin2023one to obtain an initial set of pose parameters $\mathbf{p}^{init}_{1:F}$. Then, using the Mediapipe algorithm lugaresi2019mediapipe, we fine-tune the predicted hand poses to match the detected joints $\mathbf{J}$ while constraining the hand poses $\boldsymbol{\theta_h}$ to lie in the space of plausible poses. Finally, using a temporal coherence loss, we acquire smooth and high-fidelity annotations of 3D signing avatars.
  • Figure 3: Overview of the proposed method. We employ a diffusion model to learn a mapping between text scripts and 3D sign language. The proposed framework consists of an auto-regressive denoising module $\epsilon_\Theta$ that is founded on the novel anatomically informed pose encoder to model the sign motions.
  • Figure 4: Qualitative comparison between the proposed and the baseline fitting frameworks on SGNify SGNify and How2Sign duarte2021how2sign.
  • Figure 5: Qualitative comparison of generated signs conditioned on the text transcript between the proposed and Stoll et al.stoll2022there methods. The ground truth video is given for reference.
  • ...and 2 more figures