Autoregressive Sign Language Production: A Gloss-Free Approach with Discrete Representations
Eui Jun Hwang, Huije Lee, Jong C. Park
TL;DR
This work tackles gloss-free Sign Language Production by bypassing gloss annotations and encoding sign language as discrete tokens to enable autoregressive decoding. SignVQNet discretizes sign pose sequences with a Spatio-Temporal Graph Pyramid-based dVAE and vector quantization, producing tokens z that drive a Transformer-based autoregressive generator conditioned on spoken input x. The approach achieves state-of-the-art results on PHOENIX14T and How2Sign, with Back-Translation and Fréchet Gesture Distance as robust evaluation metrics, and reveals reliability advantages over DTW-MJE. The findings suggest discrete latent representations can improve linguistic coherence and generation quality in gloss-free SLP.
Abstract
Gloss-free Sign Language Production (SLP) offers a direct translation of spoken language sentences into sign language, bypassing the need for gloss intermediaries. This paper presents the Sign language Vector Quantization Network, a novel approach to SLP that leverages Vector Quantization to derive discrete representations from sign pose sequences. Our method, rooted in both manual and non-manual elements of signing, supports advanced decoding methods and integrates latent-level alignment for enhanced linguistic coherence. Through comprehensive evaluations, we demonstrate superior performance of our method over prior SLP methods and highlight the reliability of Back-Translation and Fréchet Gesture Distance as evaluation metrics.
