Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis

Rui Hong; Jana Kosecka

Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis

Rui Hong, Jana Kosecka

Abstract

Generating natural, correct, and visually smooth 3D avatar sign language motion conditioned on the text inputs continues to be very challenging. In this work, we train a generative model of 3D body motion and explore the role of phonological attribute conditioning for sign language motion generation, using ASL-LEX 2.0 annotations such as hand shape, hand location and movement. We first establish a strong diffusion baseline using an Human Motion MDM-style diffusion model with SMPL-X representation, which outperforms SignAvatar, a state-of-the-art CVAE method, on gloss discriminability metrics. We then systematically study the role of text conditioning using different text encoders (CLIP vs. T5), conditioning modes (gloss-only vs. gloss+phonological attributes), and attribute notation format (symbolic vs. natural language). Our analysis reveals that translating symbolic ASL-LEX notations to natural language is a necessary condition for effective CLIP-based attribute conditioning, while T5 is largely unaffected by this translation. Furthermore, our best-performing variant (CLIP with mapped attributes) outperforms SignAvatar across all metrics. These findings highlight input representation as a critical factor for text-encoder-based attribute conditioning, and motivate structured conditioning approaches where gloss and phonological attributes are encoded through independent pathways.

Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis

Abstract

Paper Structure (35 sections, 3 equations, 2 figures, 3 tables)

This paper contains 35 sections, 3 equations, 2 figures, 3 tables.

Introduction
Related Work
Sign Language Generation.
Human Motion Diffusion Models.
Sign Language Phonology and ASL-LEX.
Method
Motion Representation
Diffusion Framework
Training objective.
Conditioning Mechanism
Gloss-only conditioning.
Gloss + attribute conditioning.
Text encoder variants.
Experiments
ASL3DWord.
...and 20 more sections

Figures (2)

Figure 1: Model architecture overview. Three branches prepare input tokens in parallel: (left) the gloss label---optionally concatenated with phonological attributes from ASL-LEX 2.0---is encoded by a frozen text encoder (CLIP or T5) and projected by a 3-layer MLP to form the condition token $\mathbf{e}_c$; (center) the diffusion timestep is embedded via sinusoidal encoding and a 2-layer MLP to form $\mathbf{e}_t$; (right) the noisy motion $\mathbf{x}_t \in \mathbb{R}^{T\times D}$ is linearly projected with learned positional encoding to form per-frame tokens $\mathbf{m}_1,\ldots,\mathbf{m}_T$. The concatenated sequence is processed by a 4-layer Transformer Encoder, and the output motion tokens are decoded by a per-frame MLP to predict the clean motion $\hat{\mathbf{x}}_0$.
Figure 2: Qualitative comparison for the gloss cool. We show evenly-sampled keyframes. Top row: ground-truth motion from ASL3DWord. Middle row: SignAvatar dong2024signavatar generation. Bottom row: our CLIP gloss-only diffusion model.

Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis

Abstract

Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis

Authors

Abstract

Table of Contents

Figures (2)