Table of Contents
Fetching ...

Stable Signer: Hierarchical Sign Language Generative Model

Sen Fang, Yalin Feng, Hongbin Zhong, Yanxin Zhang, Dimitris N. Metaxas

TL;DR

Stable Signer tackles error accumulation in Sign Language Production by proposing an end-to-end hierarchical framework that restricts processing to text understanding (Prompt2Gloss) and Pose2Vid, coordinated by SLUL and SLP-MoE. The model introduces Semantic-Aware Gloss Masking Loss (SAGM) to improve gloss robustness, and a gated mixture-of-experts (SLP-MoE) to produce stable, multi-style sign-language videos via a diffusion renderer. It demonstrates large gains over prior SLP methods, including a 48.6% improvement on BLEU-4/ROUGE metrics and substantial video-quality improvements across multiple benchmarks. The approach reduces pipeline error propagation and provides a practical path toward more reliable, high-fidelity ASL video generation.

Abstract

Sign Language Production (SLP) is the process of converting the complex input text into a real video. Most previous works focused on the Text2Gloss, Gloss2Pose, Pose2Vid stages, and some concentrated on Prompt2Gloss and Text2Avatar stages. However, this field has made slow progress due to the inaccuracy of text conversion, pose generation, and the rendering of poses into real human videos in these stages, resulting in gradually accumulating errors. Therefore, in this paper, we streamline the traditional redundant structure, simplify and optimize the task objective, and design a new sign language generative model called Stable Signer. It redefines the SLP task as a hierarchical generation end-to-end task that only includes text understanding (Prompt2Gloss, Text2Gloss) and Pose2Vid, and executes text understanding through our proposed new Sign Language Understanding Linker called SLUL, and generates hand gestures through the named SLP-MoE hand gesture rendering expert block to end-to-end generate high-quality and multi-style sign language videos. SLUL is trained using the newly developed Semantic-Aware Gloss Masking Loss (SAGM Loss). Its performance has improved by 48.6% compared to the current SOTA generation methods.

Stable Signer: Hierarchical Sign Language Generative Model

TL;DR

Stable Signer tackles error accumulation in Sign Language Production by proposing an end-to-end hierarchical framework that restricts processing to text understanding (Prompt2Gloss) and Pose2Vid, coordinated by SLUL and SLP-MoE. The model introduces Semantic-Aware Gloss Masking Loss (SAGM) to improve gloss robustness, and a gated mixture-of-experts (SLP-MoE) to produce stable, multi-style sign-language videos via a diffusion renderer. It demonstrates large gains over prior SLP methods, including a 48.6% improvement on BLEU-4/ROUGE metrics and substantial video-quality improvements across multiple benchmarks. The approach reduces pipeline error propagation and provides a practical path toward more reliable, high-fidelity ASL video generation.

Abstract

Sign Language Production (SLP) is the process of converting the complex input text into a real video. Most previous works focused on the Text2Gloss, Gloss2Pose, Pose2Vid stages, and some concentrated on Prompt2Gloss and Text2Avatar stages. However, this field has made slow progress due to the inaccuracy of text conversion, pose generation, and the rendering of poses into real human videos in these stages, resulting in gradually accumulating errors. Therefore, in this paper, we streamline the traditional redundant structure, simplify and optimize the task objective, and design a new sign language generative model called Stable Signer. It redefines the SLP task as a hierarchical generation end-to-end task that only includes text understanding (Prompt2Gloss, Text2Gloss) and Pose2Vid, and executes text understanding through our proposed new Sign Language Understanding Linker called SLUL, and generates hand gestures through the named SLP-MoE hand gesture rendering expert block to end-to-end generate high-quality and multi-style sign language videos. SLUL is trained using the newly developed Semantic-Aware Gloss Masking Loss (SAGM Loss). Its performance has improved by 48.6% compared to the current SOTA generation methods.

Paper Structure

This paper contains 18 sections, 9 equations, 7 figures, 6 tables, 2 algorithms.

Figures (7)

  • Figure 1: The current SLP paths and their drawbacks: SLP contains that the prompt/complex text gradually transforms into a pose video, and then it goes through the process of being converted into a real person video. However, this process is too complex to involves a lot of errors, which accumulate. Therefore, we plan to reduce the redundancy of unnecessary intermediate steps (as shown in the figure, same "Gloss" has poses that do not correspond in time), make the initial and final steps more closely connected, and then achieve end-to-end hierarchical model learning.
  • Figure 2: Overview of the Prompt2Pose and Pose2Video pipeline:(a) Text/Prompt to Gloss: The SLUL module uses a T5 encoder to process complex prompts and generate precise gloss sequences, trained with SLUL Loss, SAGM Loss, KL divergence, and contrastive loss. (b) Gloss to Pose: A rule-based stable pose prior database provides candidate poses, which are selected by the Pose Chosen Networks (MLP) guided by semantic features. The Gating Network and Expert Networks in SLP-MoE determine the optimal pose selection with stability constraints applied. (c) Pose to Video: The stabilized pose sequence is fed into a diffusion renderer to generate high-quality sign language videos in multiple styles, with the semantic features reused for fine-tuning the MoE module.
  • Figure 3: Details of the SLP-MoE module:(a) Semantic states from SLUL generate query $q$ to produce gating weights $w_k$ over $K$ pose experts. Each expert retrieves poses from a rule-based prior database, yielding the weighted blended pose $\mathbf{p}_{\text{pose}}$. (b) The blended poses are refined across temporal frames using smoothing loss $\mathcal{L}_{\text{smooth}}$, velocity loss $\mathcal{L}_{\text{vel}}$, and hand fidelity loss $\mathcal{L}_{\text{hand}}$ to ensure temporal coherence and spatial accuracy in the final stabilized sequence $\hat{P}_t$.
  • Figure 4: Efficiency Study: Comparing the DTW scores of different training periods (divided by 25% of an epoch for each training session), the lower the score, the better. We can observe that our SLUL and SLP MoE modifications have effectively improved the overall scores. The SAGM Loss, as it is a loss calculation and has no direct relation to performance, is a normal result. Therefore, we can say that all the modifications not only achieved our goals but also achieved at least a significant improvement in efficiency.
  • Figure 5: Qualitative Results & User Study. We visualize our end-to-end sign language production pipeline. (a) Simplified pose targets automatically learned by the model, enabling robust end-to-end learning for SLP—an approach increasingly adopted in recent studies. (b) Intermediate pose video frames generated during the process. (c) Final sign language video frames. Ground truth frames are provided for comparison. Our method achieves high-fidelity image generation while accurately conveying sign language pose information.
  • ...and 2 more figures