Table of Contents
Fetching ...

Ham2Pose: Animating Sign Language Notation into Pose Sequences

Rotem Shalev-Arkushin, Amit Moryossef, Ohad Fried

TL;DR

This work tackles Sign Language Production by translating HamNoSys lexical notation into signed pose sequences using a two-part Transformer framework that jointly processes HamNoSys text and a reference pose. The pose sequence is generated gradually over a fixed number of steps, guided by a diffusion-like refinement schedule and learned through a weighted, confidence-aware loss that handles missing keypoints. A novel evaluation metric, nDTW-MJE, accounts for incomplete data and normalizes keypoints to robustly compare pose trajectories, with validation on AUTSL demonstrating improved correlation with perceptual similarity over existing metrics. The approach generalizes across multiple languages and provides code and data-processing tools to foster further research toward end-to-end Sign Language Production systems.

Abstract

Translating spoken languages into Sign languages is necessary for open communication between the hearing and hearing-impaired communities. To achieve this goal, we propose the first method for animating a text written in HamNoSys, a lexical Sign language notation, into signed pose sequences. As HamNoSys is universal, our proposed method offers a generic solution invariant to the target Sign language. Our method gradually generates pose predictions using transformer encoders that create meaningful representations of the text and poses while considering their spatial and temporal information. We use weak supervision for the training process and show that our method succeeds in learning from partial and inaccurate data. Additionally, we offer a new distance measurement for pose sequences, normalized Dynamic Time Warping (nDTW), based on DTW over normalized keypoints trajectories, and validate its correctness using AUTSL, a large-scale Sign language dataset. We show that it measures the distance between pose sequences more accurately than existing measurements and use it to assess the quality of our generated pose sequences. Code for the data pre-processing, the model, and the distance measurement is publicly released for future research.

Ham2Pose: Animating Sign Language Notation into Pose Sequences

TL;DR

This work tackles Sign Language Production by translating HamNoSys lexical notation into signed pose sequences using a two-part Transformer framework that jointly processes HamNoSys text and a reference pose. The pose sequence is generated gradually over a fixed number of steps, guided by a diffusion-like refinement schedule and learned through a weighted, confidence-aware loss that handles missing keypoints. A novel evaluation metric, nDTW-MJE, accounts for incomplete data and normalizes keypoints to robustly compare pose trajectories, with validation on AUTSL demonstrating improved correlation with perceptual similarity over existing metrics. The approach generalizes across multiple languages and provides code and data-processing tools to foster further research toward end-to-end Sign Language Production systems.

Abstract

Translating spoken languages into Sign languages is necessary for open communication between the hearing and hearing-impaired communities. To achieve this goal, we propose the first method for animating a text written in HamNoSys, a lexical Sign language notation, into signed pose sequences. As HamNoSys is universal, our proposed method offers a generic solution invariant to the target Sign language. Our method gradually generates pose predictions using transformer encoders that create meaningful representations of the text and poses while considering their spatial and temporal information. We use weak supervision for the training process and show that our method succeeds in learning from partial and inaccurate data. Additionally, we offer a new distance measurement for pose sequences, normalized Dynamic Time Warping (nDTW), based on DTW over normalized keypoints trajectories, and validate its correctness using AUTSL, a large-scale Sign language dataset. We show that it measures the distance between pose sequences more accurately than existing measurements and use it to assess the quality of our generated pose sequences. Code for the data pre-processing, the model, and the distance measurement is publicly released for future research.
Paper Structure (40 sections, 8 equations, 12 figures, 8 tables, 1 algorithm)

This paper contains 40 sections, 8 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: German Sign Language sign for "Haus". Gloss is a unique semantic identifier; HamNoSys and SignWriting describe the phonology of a sign: Two flat hands with fingers closed, rotated towards each other, touching, then symmetrically moving diagonally downwards.
  • Figure 2: JASign (SigML) failure cases. hand-inside-hand (H-in-H), hand-inside-clothes (H-in-C) artifacts, wrong signing.
  • Figure 3: Results examples:Top row: original video frames, middle row: ground truth pose detected by OpenPose, bottom row: generated pose. Despite missing keypoints in the ground truth pose, our model generates a correct pose.
  • Figure 4: Model architecture. First, the text processor encodes the HamNoSys and predicts the sequence length. Next, the reference pose is duplicated to the sequence length and passed to the pose generator, which iteratively uses the current pose sequence and HamNoSys encoding for T steps and generates the desired pose. After T steps, the pose generator outputs the final pose sequence.
  • Figure 5: Blend importance example. Left to right: original image, original pose, pose generated by addition, by replacement, and by blend.
  • ...and 7 more figures