Table of Contents
Fetching ...

Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis

Rui Hong, Jana Kosecka

Abstract

Generating natural, correct, and visually smooth 3D avatar sign language motion conditioned on the text inputs continues to be very challenging. In this work, we train a generative model of 3D body motion and explore the role of phonological attribute conditioning for sign language motion generation, using ASL-LEX 2.0 annotations such as hand shape, hand location and movement. We first establish a strong diffusion baseline using an Human Motion MDM-style diffusion model with SMPL-X representation, which outperforms SignAvatar, a state-of-the-art CVAE method, on gloss discriminability metrics. We then systematically study the role of text conditioning using different text encoders (CLIP vs. T5), conditioning modes (gloss-only vs. gloss+phonological attributes), and attribute notation format (symbolic vs. natural language). Our analysis reveals that translating symbolic ASL-LEX notations to natural language is a necessary condition for effective CLIP-based attribute conditioning, while T5 is largely unaffected by this translation. Furthermore, our best-performing variant (CLIP with mapped attributes) outperforms SignAvatar across all metrics. These findings highlight input representation as a critical factor for text-encoder-based attribute conditioning, and motivate structured conditioning approaches where gloss and phonological attributes are encoded through independent pathways.

Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis

Abstract

Generating natural, correct, and visually smooth 3D avatar sign language motion conditioned on the text inputs continues to be very challenging. In this work, we train a generative model of 3D body motion and explore the role of phonological attribute conditioning for sign language motion generation, using ASL-LEX 2.0 annotations such as hand shape, hand location and movement. We first establish a strong diffusion baseline using an Human Motion MDM-style diffusion model with SMPL-X representation, which outperforms SignAvatar, a state-of-the-art CVAE method, on gloss discriminability metrics. We then systematically study the role of text conditioning using different text encoders (CLIP vs. T5), conditioning modes (gloss-only vs. gloss+phonological attributes), and attribute notation format (symbolic vs. natural language). Our analysis reveals that translating symbolic ASL-LEX notations to natural language is a necessary condition for effective CLIP-based attribute conditioning, while T5 is largely unaffected by this translation. Furthermore, our best-performing variant (CLIP with mapped attributes) outperforms SignAvatar across all metrics. These findings highlight input representation as a critical factor for text-encoder-based attribute conditioning, and motivate structured conditioning approaches where gloss and phonological attributes are encoded through independent pathways.
Paper Structure (35 sections, 3 equations, 2 figures, 3 tables)

This paper contains 35 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Model architecture overview. Three branches prepare input tokens in parallel: (left) the gloss label---optionally concatenated with phonological attributes from ASL-LEX 2.0---is encoded by a frozen text encoder (CLIP or T5) and projected by a 3-layer MLP to form the condition token $\mathbf{e}_c$; (center) the diffusion timestep is embedded via sinusoidal encoding and a 2-layer MLP to form $\mathbf{e}_t$; (right) the noisy motion $\mathbf{x}_t \in \mathbb{R}^{T\times D}$ is linearly projected with learned positional encoding to form per-frame tokens $\mathbf{m}_1,\ldots,\mathbf{m}_T$. The concatenated sequence is processed by a 4-layer Transformer Encoder, and the output motion tokens are decoded by a per-frame MLP to predict the clean motion $\hat{\mathbf{x}}_0$.
  • Figure 2: Qualitative comparison for the gloss cool. We show evenly-sampled keyframes. Top row: ground-truth motion from ASL3DWord. Middle row: SignAvatar dong2024signavatar generation. Bottom row: our CLIP gloss-only diffusion model.