A Spatio-Temporal Representation Learning as an Alternative to Traditional Glosses in Sign Language Translation and Production

Eui Jun Hwang; Sukmin Cho; Huije Lee; Youngwoo Yoon; Jong C. Park

A Spatio-Temporal Representation Learning as an Alternative to Traditional Glosses in Sign Language Translation and Production

Eui Jun Hwang, Sukmin Cho, Huije Lee, Youngwoo Yoon, Jong C. Park

TL;DR

This paper tackles the inefficiencies and limitations of gloss-based intermediates in sign language translation and production by introducing UniGloR, a self-supervised framework that learns dense spatio-temporal representations from sign keypoints. Through SignMAE, a Transformer-based masked autoencoder, UniGloR derives implicit gloss-level representations and employs adaptive pose weighting to preserve subtle signing motions. It then uses task-specific mappings for Sign-to-Text and Text-to-Sign, including a non-autoregressive decoder and a length regulator, to perform SLT and SLP without explicit gloss annotations. Experiments on PHOENIX14T and How2Sign show competitive or superior performance to gloss-based methods and strong robustness to out-of-domain data, highlighting the practical potential of SSL-derived intermediate representations for sign language processing.

Abstract

This work addresses the challenges associated with the use of glosses in both Sign Language Translation (SLT) and Sign Language Production (SLP). While glosses have long been used as a bridge between sign language and spoken language, they come with two major limitations that impede the advancement of sign language systems. First, annotating the glosses is a labor-intensive and time-consuming process, which limits the scalability of datasets. Second, the glosses oversimplify sign language by stripping away its spatio-temporal dynamics, reducing complex signs to basic labels and missing the subtle movements essential for precise interpretation. To address these limitations, we introduce Universal Gloss-level Representation (UniGloR), a framework designed to capture the spatio-temporal features inherent in sign language, providing a more dynamic and detailed alternative to the use of the glosses. The core idea of UniGloR is simple yet effective: We derive dense spatio-temporal representations from sign keypoint sequences using self-supervised learning and seamlessly integrate them into SLT and SLP tasks. Our experiments in a keypoint-based setting demonstrate that UniGloR either outperforms or matches the performance of previous SLT and SLP methods on two widely-used datasets: PHOENIX14T and How2Sign.

A Spatio-Temporal Representation Learning as an Alternative to Traditional Glosses in Sign Language Translation and Production

TL;DR

Abstract

Paper Structure (42 sections, 9 equations, 10 figures, 8 tables)

This paper contains 42 sections, 9 equations, 10 figures, 8 tables.

Introduction
Related Work
Sign Language Translation
Sign Language Production
Self-Supervised Learning
Method
Framework Overview
SignMAE
Adaptive Pose Weighting
Task Specific Mapping
Sign-to-Text Mapping.
Text-to-Sign Mapping.
Experimental Settings
Dataset and Evaluation Metrics
Datasets.
...and 27 more sections

Figures (10)

Figure 1: Comparison of the conventional and proposed approaches: (a) incorporating glosses as intermediates, and (b) UniGloR, which uses a learned latent space as intermediates, but not using gloss annotation.
Figure 2: An overview of the proposed framework. (a) SignMAE: We pretrain SignMAE to derive implicit gloss-level representations $z$ from randomly sampled sign segments. (b) Sign-to-Text Mapping: We employ a sliding window approach to fully leverage and enrich the implicit representations, which we then map autoregressively to spoken language sentences. (c) Text-to-Sign Mapping: A non-autoregressive approach is utilized to map spoken language sentences to gloss-level representations. Following this, a length regulator adjusts the output length. Finally, the frozen SignMAE decoder generates the final sign keypoint sequences.
Figure 3: SLT performance curve depends on stride.
Figure 4: A visual comparison of our method against baselines, including PT and NSLP-G, on PHOENIX14T. We uniformly selected frames. The dashed boxes highlight where our method produce more realistic and accurate sign keypoint sequences. GT refers to the Ground-Truth keypoints.
Figure 5: Visualization of gloss-level representations on the unseen KSL-Guide-Word dataset in a zero-shot setting. Ten of the most frequent glosses were selected for this analysis. The 2D plot was generated using T-SNE van2008visualizing.
...and 5 more figures

A Spatio-Temporal Representation Learning as an Alternative to Traditional Glosses in Sign Language Translation and Production

TL;DR

Abstract

A Spatio-Temporal Representation Learning as an Alternative to Traditional Glosses in Sign Language Translation and Production

Authors

TL;DR

Abstract

Table of Contents

Figures (10)