Table of Contents
Fetching ...

Autoregressive Sign Language Production: A Gloss-Free Approach with Discrete Representations

Eui Jun Hwang, Huije Lee, Jong C. Park

TL;DR

This work tackles gloss-free Sign Language Production by bypassing gloss annotations and encoding sign language as discrete tokens to enable autoregressive decoding. SignVQNet discretizes sign pose sequences with a Spatio-Temporal Graph Pyramid-based dVAE and vector quantization, producing tokens z that drive a Transformer-based autoregressive generator conditioned on spoken input x. The approach achieves state-of-the-art results on PHOENIX14T and How2Sign, with Back-Translation and Fréchet Gesture Distance as robust evaluation metrics, and reveals reliability advantages over DTW-MJE. The findings suggest discrete latent representations can improve linguistic coherence and generation quality in gloss-free SLP.

Abstract

Gloss-free Sign Language Production (SLP) offers a direct translation of spoken language sentences into sign language, bypassing the need for gloss intermediaries. This paper presents the Sign language Vector Quantization Network, a novel approach to SLP that leverages Vector Quantization to derive discrete representations from sign pose sequences. Our method, rooted in both manual and non-manual elements of signing, supports advanced decoding methods and integrates latent-level alignment for enhanced linguistic coherence. Through comprehensive evaluations, we demonstrate superior performance of our method over prior SLP methods and highlight the reliability of Back-Translation and Fréchet Gesture Distance as evaluation metrics.

Autoregressive Sign Language Production: A Gloss-Free Approach with Discrete Representations

TL;DR

This work tackles gloss-free Sign Language Production by bypassing gloss annotations and encoding sign language as discrete tokens to enable autoregressive decoding. SignVQNet discretizes sign pose sequences with a Spatio-Temporal Graph Pyramid-based dVAE and vector quantization, producing tokens z that drive a Transformer-based autoregressive generator conditioned on spoken input x. The approach achieves state-of-the-art results on PHOENIX14T and How2Sign, with Back-Translation and Fréchet Gesture Distance as robust evaluation metrics, and reveals reliability advantages over DTW-MJE. The findings suggest discrete latent representations can improve linguistic coherence and generation quality in gloss-free SLP.

Abstract

Gloss-free Sign Language Production (SLP) offers a direct translation of spoken language sentences into sign language, bypassing the need for gloss intermediaries. This paper presents the Sign language Vector Quantization Network, a novel approach to SLP that leverages Vector Quantization to derive discrete representations from sign pose sequences. Our method, rooted in both manual and non-manual elements of signing, supports advanced decoding methods and integrates latent-level alignment for enhanced linguistic coherence. Through comprehensive evaluations, we demonstrate superior performance of our method over prior SLP methods and highlight the reliability of Back-Translation and Fréchet Gesture Distance as evaluation metrics.
Paper Structure (20 sections, 7 equations, 4 figures, 3 tables)

This paper contains 20 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: An overview of SignVQNet, where both the pre-trained STGP encdoer and decoder remain frozen. The encoder converts the sign pose sequence into discrete tokens. These tokens are then generated from textual inputs by Transformer. The decoder transforms the generated tokens back into an actual sign pose sequence.
  • Figure 2: An overview of the STGP block. Spatial Convolution (SC) and Temporal Convolution (TC) process the input spatially and temporally, respectively, and the subsampled output preserves spatio-temporal features through a residual connection. BN refers to Batch Normalization.
  • Figure 3: Performance discrepancy among the SLP metrics in relation to change in beam size. Except for DTW-MJE, all metrics show consistent improvement as beam size increases.
  • Figure 4: We present a visual comparison between our method and the baselines on both (a) PHOENIX-2014-T and (b) How2Sign. As highlighted in the dashed boxes, our method generates more realistic and accurate sign pose sequences. Videos are available at http://nlpcl.kaist.ac.kr/ projects/signvqnet