A$^{2}$V-SLP: Alignment-Aware Variational Modeling for Disentangled Sign Language Production
Sümeyye Meryem Taşyürek, Enis Mücahid İskender, Hacer Yalim Keles
TL;DR
A$^{2}$V-SLP addresses the challenge of gloss-free Sign Language Production by introducing an alignment-aware variational framework that learns articulator-wise latent distributions rather than deterministic embeddings. A structurally disentangled VAE provides per-articulator means and variances, which supervise a non-autoregressive Transformer that generates distributional latent targets from text, aided by gloss attention to enforce local temporal alignment without gloss labels. The training proceeds in two phases, first regressing latent statistics and then aligning predicted distributions via KL regularization, with adaptive region-focused reconstruction weighting to preserve fine hand articulation. Results on PHOENIX-2014T and CSL-Daily show consistent improvements over deterministic latent regression and state-of-the-art back-translation performance, demonstrating the effectiveness of combining variational latent modeling, alignment-aware attention, and adaptive training for gloss-free SLP.
Abstract
Building upon recent structural disentanglement frameworks for sign language production, we propose A$^{2}$V-SLP, an alignment-aware variational framework that learns articulator-wise disentangled latent distributions rather than deterministic embeddings. A disentangled Variational Autoencoder (VAE) encodes ground-truth sign pose sequences and extracts articulator-specific mean and variance vectors, which are used as distributional supervision for training a non-autoregressive Transformer. Given text embeddings, the Transformer predicts both latent means and log-variances, while the VAE decoder reconstructs the final sign pose sequences through stochastic sampling at the decoding stage. This formulation maintains articulator-level representations by avoiding deterministic latent collapse through distributional latent modeling. In addition, we integrate a gloss attention mechanism to strengthen alignment between linguistic input and articulated motion. Experimental results show consistent gains over deterministic latent regression, achieving state-of-the-art back-translation performance and improved motion realism in a fully gloss-free setting.
