SignLLM: Sign Language Production Large Language Models

Sen Fang; Chen Chen; Lei Wang; Ce Zheng; Chunyu Sui; Yapeng Tian

SignLLM: Sign Language Production Large Language Models

Sen Fang, Chen Chen, Lei Wang, Ce Zheng, Chunyu Sui, Yapeng Tian

TL;DR

SignLLM introduces a multilingual Sign Language Production model with two specialized modes, MLSF and Prompt2LangGloss, designed to translate text or prompts into sign-language poses across eight languages. It is trained on Prompt2Sign, a standardized, OpenPose-derived pose dataset, and employs a novel RL-based loss (RL Loss) with a Priority Learning Channel to accelerate training and prioritize valuable data. The paper demonstrates state-of-the-art performance across eight sign languages via extensive quantitative, ablation, and qualitative evaluations, and shows substantial training-time reductions (e.g., 27.1% in one ablation) when using RL-based prioritization. The work also provides a detailed supplementary suite covering data processing, dataset details, and extensibility studies, aiming to establish Prompt2Sign as a scalable foundation for multilingual SLP. Overall, SignLLM advances multilingual SLP by combining a design that preserves translation efficiency with LLM-based interaction capabilities, targeted at practical, scalable deployment in assistive and educational contexts.

Abstract

In this paper, we propose SignLLM, a multilingual Sign Language Production (SLP) large language model, which includes two novel multilingual SLP modes MLSF and Prompt2LangGloss that allow sign language gestures generation from query texts input and question-style prompts input respectively. Both modes can use a new RL loss based on reinforcement learning and a new RL module named Priority Learning Channel. These RL components can accelerate the training by enhancing the model's capability to sample high-quality data. To train SignLLM, we introduce Prompt2Sign, a comprehensive multilingual sign language dataset, which builds from public data, including American Sign Language (ASL) and seven others. This dataset standardizes information by extracting pose information from sign language videos into a unified compressed format. We extensively evaluate SignLLM, demonstrating that our model achieves state-of-the-art performance on SLP tasks across eight sign languages.

SignLLM: Sign Language Production Large Language Models

TL;DR

Abstract

Paper Structure (58 sections, 4 equations, 7 figures, 12 tables)

This paper contains 58 sections, 4 equations, 7 figures, 12 tables.

Introduction
Related Work
Our Benchmark: Prompt2Sign
Our Model: SignLLM
Preliminary of Text2Pose Method
Design Overview
Two Multilingual SLP Modes
Reinforcement Learning Training Strategy
Experiments and Discussions
Quantitative Evaluation
Ablation Evaluation
Qualitative Evaluation
Discussion
Conclusion
Overview of Supplementary Materials
...and 43 more sections

Figures (7)

Figure 1: Overview: (Left) Major components (e.g., none, none, none, etc.) of our Prompt2Sign dataset. Compressed Pose is reprocessed pose data that is suitable for training, we use public sign language videos to produce compressed pose data in our predefined format; (Right) Our proposed SignLLM aims to generate sign language poses for digital human or avatar generation chen2023executingcai2023smplerxzwitserlood2004syntheticzhang2023adding.
Figure 2: (Left) MLSF contains parallel Enc-Dec groups (i.e., Text2Pose $\times$ number of languages), the Prompt2LangGloss adds a language attribute marker at the gloss channel (i.e., Text2Gloss2Pose $\rightarrow$ Prompt2LangGloss2Pose). (Right) The output of SignLLM can be converted into a skeletal pose video, which can then be rendered into a realistic human appearance by vid2vid models NEURIPS2022_ec795aeachan2019everybodywei2020gaczhou2019dancezhang2023adding.
Figure 3: RL elements: User, Agent, Environment, Cyclic Sampling, PLC to sketch the sequence prediction learning process.
Figure 4: RL Training Efficiency Analysis: Comparison of different Settings on DTW values (the lower the better) at different peroid (every 30 epochs a peroid). Left Y-axis: Value of DTW. Right Y-axis: Value of Loss. Prompt: Prompt2LangGloss mode.
Figure 5: We use an adjusted vid2vid model feng2023dreamoving to convert the predicted skeletal pose video into a more realistic final video.
...and 2 more figures

SignLLM: Sign Language Production Large Language Models

TL;DR

Abstract

SignLLM: Sign Language Production Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)