Table of Contents
Fetching ...

Sign-IDD: Iconicity Disentangled Diffusion for Sign Language Production

Shengeng Tang, Jiayi He, Dan Guo, Yanyan Wei, Feng Li, Richang Hong

TL;DR

This paper tackles the problem of generating semantically consistent sign language poses from textual glosses. It introduces Sign-IDD, a diffusion-based framework that disentangles iconicity by converting 3D joints into a 4D bone representation and employs an Attribute Controllable Diffusion module conditioned on gloss semantics to constrain bone direction and length. The pose generation is further refined with joint and bone constraint losses, improving pose accuracy and skeletal coherence. Extensive experiments on PHOENIX14T and USTC-CSL demonstrate superior performance over state-of-the-art methods in back-translation metrics and pose quality, highlighting Sign-IDD's potential for more natural and linguistically faithful sign language production.

Abstract

Sign Language Production (SLP) aims to generate semantically consistent sign videos from textual statements, where the conversion from textual glosses to sign poses (G2P) is a crucial step. Existing G2P methods typically treat sign poses as discrete three-dimensional coordinates and directly fit them, which overlooks the relative positional relationships among joints. To this end, we provide a new perspective, constraining joint associations and gesture details by modeling the limb bones to improve the accuracy and naturalness of the generated poses. In this work, we propose a pioneering iconicity disentangled diffusion framework, termed Sign-IDD, specifically designed for SLP. Sign-IDD incorporates a novel Iconicity Disentanglement (ID) module to bridge the gap between relative positions among joints. The ID module disentangles the conventional 3D joint representation into a 4D bone representation, comprising the 3D spatial direction vector and 1D spatial distance vector between adjacent joints. Additionally, an Attribute Controllable Diffusion (ACD) module is introduced to further constrain joint associations, in which the attribute separation layer aims to separate the bone direction and length attributes, and the attribute control layer is designed to guide the pose generation by leveraging the above attributes. The ACD module utilizes the gloss embeddings as semantic conditions and finally generates sign poses from noise embeddings. Extensive experiments on PHOENIX14T and USTC-CSL datasets validate the effectiveness of our method. The code is available at: https://github.com/NaVi-start/Sign-IDD.

Sign-IDD: Iconicity Disentangled Diffusion for Sign Language Production

TL;DR

This paper tackles the problem of generating semantically consistent sign language poses from textual glosses. It introduces Sign-IDD, a diffusion-based framework that disentangles iconicity by converting 3D joints into a 4D bone representation and employs an Attribute Controllable Diffusion module conditioned on gloss semantics to constrain bone direction and length. The pose generation is further refined with joint and bone constraint losses, improving pose accuracy and skeletal coherence. Extensive experiments on PHOENIX14T and USTC-CSL demonstrate superior performance over state-of-the-art methods in back-translation metrics and pose quality, highlighting Sign-IDD's potential for more natural and linguistically faithful sign language production.

Abstract

Sign Language Production (SLP) aims to generate semantically consistent sign videos from textual statements, where the conversion from textual glosses to sign poses (G2P) is a crucial step. Existing G2P methods typically treat sign poses as discrete three-dimensional coordinates and directly fit them, which overlooks the relative positional relationships among joints. To this end, we provide a new perspective, constraining joint associations and gesture details by modeling the limb bones to improve the accuracy and naturalness of the generated poses. In this work, we propose a pioneering iconicity disentangled diffusion framework, termed Sign-IDD, specifically designed for SLP. Sign-IDD incorporates a novel Iconicity Disentanglement (ID) module to bridge the gap between relative positions among joints. The ID module disentangles the conventional 3D joint representation into a 4D bone representation, comprising the 3D spatial direction vector and 1D spatial distance vector between adjacent joints. Additionally, an Attribute Controllable Diffusion (ACD) module is introduced to further constrain joint associations, in which the attribute separation layer aims to separate the bone direction and length attributes, and the attribute control layer is designed to guide the pose generation by leveraging the above attributes. The ACD module utilizes the gloss embeddings as semantic conditions and finally generates sign poses from noise embeddings. Extensive experiments on PHOENIX14T and USTC-CSL datasets validate the effectiveness of our method. The code is available at: https://github.com/NaVi-start/Sign-IDD.

Paper Structure

This paper contains 32 sections, 17 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a) Problems caused by using only traditional 3D representations for SLP. (b) Traditional 3D joint coordinate representation saunders2020progressivetang2022gloss. (b) Proposed 4D bone representation. We take the neck joint as the root joint and define the parent-child joints for each bone along the skeletons. The Euclidean distance and direction vectors between parent-child joints are used to determine the length and orientation of the bones.
  • Figure 2: Overview of our framework. Given a gloss sequence, Sign-IDD generates a coherent sign pose video guided by gloss semantics. During training, we initially create Noise Pose $p_t$ by adding Gaussian noise to Target Pose $p_0$ for $t$ steps. Next, the 4D representation is derived from the 3D joint coordinates through Iconicity Disentanglement (ID). Then, the combination of the 3D and 4D representations is fed into the Attribute Controllable Diffusion (ACD) module, where gloss embeddings are integrated as a semantic condition. The attribute separation and control layers aim to separate skeletal attributes and guide pose generation. The final poses are optimized by applying joint and bone constraints. During inference, the initial $p_T$ is randomly sampled from Gaussian noise, with the disentanglement and reverse diffusion processes mirroring those used in training.
  • Figure 3: The main components of our ACD module.
  • Figure 4: Visualization examples of generated poses on PHOENIX14T (top) and USTC-CSL (bottom). We compare Sign-IDD with PT-GN and GEN-OBT. Gloss annotations, original video frames, and ground-truth poses are attached for clear evaluation.
  • Figure 5: Visualization results of Sign-IDD and Base, which is a diffusion-based baseline without ID and ACD modules.