SignDiff: Diffusion Model for American Sign Language Production
Sen Fang, Chunyu Sui, Yanghao Zhou, Xuedong Zhang, Hongbin Zhong, Yapeng Tian, Chen Chen
TL;DR
This work presents SignDiff, a diffusion-based framework for American Sign Language production that translates spoken text into sign-language pose videos. A Frame Reinforcement Network (FR-Net) provides DensePose-style shape conditioning to a ControlNet-based diffusion model, improving frame-level fidelity and reducing finger-level distortions. The authors introduce text2pose and pose2video components with tensor- and loss-optimization strategies to handle large-scaleHow2Sign data, achieving state-of-the-art results on PHOENIX14T and strong BLEU-4 performance on How2Sign, alongside notable SSIM gains. Ablation studies validate the importance of FR-Net and the proposed training objectives for efficient, high-quality sign language video generation. Overall, the approach advances end-to-end ASL production from text with potential impact for learning tools and media generation.
Abstract
In this paper, we propose a dual-condition diffusion pre-training model named SignDiff that can generate human sign language speakers from a skeleton pose. SignDiff has a novel Frame Reinforcement Network called FR-Net, similar to dense human pose estimation work, which enhances the correspondence between text lexical symbols and sign language dense pose frames, reduces the occurrence of multiple fingers in the diffusion model. In addition, we propose a new method for American Sign Language Production (ASLP), which can generate ASL skeletal pose videos from text input, integrating two new improved modules and a new loss function to improve the accuracy and quality of sign language skeletal posture and enhance the ability of the model to train on large-scale data. We propose the first baseline for ASL production and report the scores of 17.19 and 12.85 on BLEU-4 on the How2Sign dev/test sets. We evaluated our model on the previous mainstream dataset PHOENIX14T, and the experiments achieved the SOTA results. In addition, our image quality far exceeds all previous results by 10 percentage points in terms of SSIM.
