SignDiff: Diffusion Model for American Sign Language Production

Sen Fang; Chunyu Sui; Yanghao Zhou; Xuedong Zhang; Hongbin Zhong; Yapeng Tian; Chen Chen

SignDiff: Diffusion Model for American Sign Language Production

Sen Fang, Chunyu Sui, Yanghao Zhou, Xuedong Zhang, Hongbin Zhong, Yapeng Tian, Chen Chen

TL;DR

This work presents SignDiff, a diffusion-based framework for American Sign Language production that translates spoken text into sign-language pose videos. A Frame Reinforcement Network (FR-Net) provides DensePose-style shape conditioning to a ControlNet-based diffusion model, improving frame-level fidelity and reducing finger-level distortions. The authors introduce text2pose and pose2video components with tensor- and loss-optimization strategies to handle large-scaleHow2Sign data, achieving state-of-the-art results on PHOENIX14T and strong BLEU-4 performance on How2Sign, alongside notable SSIM gains. Ablation studies validate the importance of FR-Net and the proposed training objectives for efficient, high-quality sign language video generation. Overall, the approach advances end-to-end ASL production from text with potential impact for learning tools and media generation.

Abstract

In this paper, we propose a dual-condition diffusion pre-training model named SignDiff that can generate human sign language speakers from a skeleton pose. SignDiff has a novel Frame Reinforcement Network called FR-Net, similar to dense human pose estimation work, which enhances the correspondence between text lexical symbols and sign language dense pose frames, reduces the occurrence of multiple fingers in the diffusion model. In addition, we propose a new method for American Sign Language Production (ASLP), which can generate ASL skeletal pose videos from text input, integrating two new improved modules and a new loss function to improve the accuracy and quality of sign language skeletal posture and enhance the ability of the model to train on large-scale data. We propose the first baseline for ASL production and report the scores of 17.19 and 12.85 on BLEU-4 on the How2Sign dev/test sets. We evaluated our model on the previous mainstream dataset PHOENIX14T, and the experiments achieved the SOTA results. In addition, our image quality far exceeds all previous results by 10 percentage points in terms of SSIM.

SignDiff: Diffusion Model for American Sign Language Production

TL;DR

Abstract

Paper Structure (17 sections, 5 equations, 7 figures, 6 tables)

This paper contains 17 sections, 5 equations, 7 figures, 6 tables.

INTRODUCTION
RELATED WORK
Sign Language Production
Rendering of Conditional Input
METHODOLOGY
Motivation and Design Idea
Dataset Processing
Preliminary
IMPLEMENTATION DETAILS
Tensor-Optimization-Based New Text2Pose Method
SignDiff for Efficient Pose2Video Production
EXPERIMENTS
Experimental Setup
Evaluation for Fast-SLP
Evaluation for SignDiff
...and 2 more sections

Figures (7)

Figure 1: Our Sign Language Production: First, input the English text you want to translate, and our new Text2Pose method will translate the text into continuous skeletal pose sequences. Then we plot the pose data of each frame into appropriate images, and users can further generate the final video using our new pose-conditioned human synthesis model (SignDiff).
Figure 2: (a) For our aslp/text2pose approach, the training data comprises spoken text extracted from videos and pose information ($x$: text, $p$: keypoints json file, $y$: original video frame). (b) The interaction principle of training data and FR-Net. When using SignDiff, the original frames of the video are used to learn human body shape through FR-Net, help SignDiff realize the function of rendering real human images from visual images of pose key points. $\bm{e}_{1}$ to $\bm{e}_{4}$ represents the output (i.e., dual-conditional intermediate representation, it is the superposition of skeleton and shape). The output of FR-Net can exhibit diverse styles, which are influenced by several factors, including image resolution and fine details.
Figure 3: We condition Stable Diffusion (SD) via concatenation or by a more general cross-attention mechanism, which is now the main theoretical basis for controlling SD. $e_j$ is the combined skeleton pose and predicted shape representation, how to get it will be covered in detail in Sec. \ref{['sec:SignDiff']}.
Figure 4: The $\bm{p}_{w}$, $\bm{e}_{j}$ and $\bm{y}_{t}$ represent profound characteristics within neural networks. The term "zero convolution" denotes a convolution layer with dimensions of 1 × 1, where the weight and bias parameters are initialized to zeroes. This diagram shows how we can add a second extra condition to the ControlNet.
Figure 5: Comparison of different settings on DTW values (the lower the better) at different training times.
...and 2 more figures

SignDiff: Diffusion Model for American Sign Language Production

TL;DR

Abstract

SignDiff: Diffusion Model for American Sign Language Production

Authors

TL;DR

Abstract

Table of Contents

Figures (7)