Table of Contents
Fetching ...

A Data-Driven Representation for Sign Language Production

Harry Walsh, Abolfazl Ravanshad, Mariam Rahmani, Richard Bowden

TL;DR

This work introduces a data-driven, discrete representation for sign language production by learning a pose codebook from 3D sign data using NSVQ and translating spoken language to a sequence of codebook tokens with a Transformer. Each token maps to a short sequence of poses, and a sign-stitching module ensures smooth, continuous signing, reducing reliance on expensive gloss annotations. The approach achieves state-of-the-art back-translation performance on PHOENIX14T and mdGS, with substantial BLEU-1 gains and strong DTW-based pose accuracy, and demonstrates cross-dataset codebook transfer. Overall, the method provides a scalable, annotation-light path to high-quality sign-language production and enables sharing of codebooks across datasets.

Abstract

Phonetic representations are used when recording spoken languages, but no equivalent exists for recording signed languages. As a result, linguists have proposed several annotation systems that operate on the gloss or sub-unit level; however, these resources are notably irregular and scarce. Sign Language Production (SLP) aims to automatically translate spoken language sentences into continuous sequences of sign language. However, current state-of-the-art approaches rely on scarce linguistic resources to work. This has limited progress in the field. This paper introduces an innovative solution by transforming the continuous pose generation problem into a discrete sequence generation problem. Thus, overcoming the need for costly annotation. Although, if available, we leverage the additional information to enhance our approach. By applying Vector Quantisation (VQ) to sign language data, we first learn a codebook of short motions that can be combined to create a natural sequence of sign. Where each token in the codebook can be thought of as the lexicon of our representation. Then using a transformer we perform a translation from spoken language text to a sequence of codebook tokens. Each token can be directly mapped to a sequence of poses allowing the translation to be performed by a single network. Furthermore, we present a sign stitching method to effectively join tokens together. We evaluate on the RWTH-PHOENIX-Weather-2014T (PHOENIX14T) and the more challenging Meine DGS Annotated (mDGS) datasets. An extensive evaluation shows our approach outperforms previous methods, increasing the BLEU-1 back translation score by up to 72%.

A Data-Driven Representation for Sign Language Production

TL;DR

This work introduces a data-driven, discrete representation for sign language production by learning a pose codebook from 3D sign data using NSVQ and translating spoken language to a sequence of codebook tokens with a Transformer. Each token maps to a short sequence of poses, and a sign-stitching module ensures smooth, continuous signing, reducing reliance on expensive gloss annotations. The approach achieves state-of-the-art back-translation performance on PHOENIX14T and mdGS, with substantial BLEU-1 gains and strong DTW-based pose accuracy, and demonstrates cross-dataset codebook transfer. Overall, the method provides a scalable, annotation-light path to high-quality sign-language production and enables sharing of codebooks across datasets.

Abstract

Phonetic representations are used when recording spoken languages, but no equivalent exists for recording signed languages. As a result, linguists have proposed several annotation systems that operate on the gloss or sub-unit level; however, these resources are notably irregular and scarce. Sign Language Production (SLP) aims to automatically translate spoken language sentences into continuous sequences of sign language. However, current state-of-the-art approaches rely on scarce linguistic resources to work. This has limited progress in the field. This paper introduces an innovative solution by transforming the continuous pose generation problem into a discrete sequence generation problem. Thus, overcoming the need for costly annotation. Although, if available, we leverage the additional information to enhance our approach. By applying Vector Quantisation (VQ) to sign language data, we first learn a codebook of short motions that can be combined to create a natural sequence of sign. Where each token in the codebook can be thought of as the lexicon of our representation. Then using a transformer we perform a translation from spoken language text to a sequence of codebook tokens. Each token can be directly mapped to a sequence of poses allowing the translation to be performed by a single network. Furthermore, we present a sign stitching method to effectively join tokens together. We evaluate on the RWTH-PHOENIX-Weather-2014T (PHOENIX14T) and the more challenging Meine DGS Annotated (mDGS) datasets. An extensive evaluation shows our approach outperforms previous methods, increasing the BLEU-1 back translation score by up to 72%.
Paper Structure (25 sections, 9 equations, 3 figures, 7 tables)

This paper contains 25 sections, 9 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: A overview of our approach to slp. Showing 1) the source spoken language sentence, 2) our intermediate representation of sign, 3) the synthesized sequence of signing, and, 4) the original video.
  • Figure 2: An overview of the architecture used in our approach. Showing a) The Codebook training architecture and b) the Text-to-Codebook Tokens Translation architecture.
  • Figure 3: A Translation example produced by our best model on the RWTH-PHOENIX-Weather-2014T dataset.