Table of Contents
Fetching ...

A$^{2}$V-SLP: Alignment-Aware Variational Modeling for Disentangled Sign Language Production

Sümeyye Meryem Taşyürek, Enis Mücahid İskender, Hacer Yalim Keles

TL;DR

A$^{2}$V-SLP addresses the challenge of gloss-free Sign Language Production by introducing an alignment-aware variational framework that learns articulator-wise latent distributions rather than deterministic embeddings. A structurally disentangled VAE provides per-articulator means and variances, which supervise a non-autoregressive Transformer that generates distributional latent targets from text, aided by gloss attention to enforce local temporal alignment without gloss labels. The training proceeds in two phases, first regressing latent statistics and then aligning predicted distributions via KL regularization, with adaptive region-focused reconstruction weighting to preserve fine hand articulation. Results on PHOENIX-2014T and CSL-Daily show consistent improvements over deterministic latent regression and state-of-the-art back-translation performance, demonstrating the effectiveness of combining variational latent modeling, alignment-aware attention, and adaptive training for gloss-free SLP.

Abstract

Building upon recent structural disentanglement frameworks for sign language production, we propose A$^{2}$V-SLP, an alignment-aware variational framework that learns articulator-wise disentangled latent distributions rather than deterministic embeddings. A disentangled Variational Autoencoder (VAE) encodes ground-truth sign pose sequences and extracts articulator-specific mean and variance vectors, which are used as distributional supervision for training a non-autoregressive Transformer. Given text embeddings, the Transformer predicts both latent means and log-variances, while the VAE decoder reconstructs the final sign pose sequences through stochastic sampling at the decoding stage. This formulation maintains articulator-level representations by avoiding deterministic latent collapse through distributional latent modeling. In addition, we integrate a gloss attention mechanism to strengthen alignment between linguistic input and articulated motion. Experimental results show consistent gains over deterministic latent regression, achieving state-of-the-art back-translation performance and improved motion realism in a fully gloss-free setting.

A$^{2}$V-SLP: Alignment-Aware Variational Modeling for Disentangled Sign Language Production

TL;DR

AV-SLP addresses the challenge of gloss-free Sign Language Production by introducing an alignment-aware variational framework that learns articulator-wise latent distributions rather than deterministic embeddings. A structurally disentangled VAE provides per-articulator means and variances, which supervise a non-autoregressive Transformer that generates distributional latent targets from text, aided by gloss attention to enforce local temporal alignment without gloss labels. The training proceeds in two phases, first regressing latent statistics and then aligning predicted distributions via KL regularization, with adaptive region-focused reconstruction weighting to preserve fine hand articulation. Results on PHOENIX-2014T and CSL-Daily show consistent improvements over deterministic latent regression and state-of-the-art back-translation performance, demonstrating the effectiveness of combining variational latent modeling, alignment-aware attention, and adaptive training for gloss-free SLP.

Abstract

Building upon recent structural disentanglement frameworks for sign language production, we propose AV-SLP, an alignment-aware variational framework that learns articulator-wise disentangled latent distributions rather than deterministic embeddings. A disentangled Variational Autoencoder (VAE) encodes ground-truth sign pose sequences and extracts articulator-specific mean and variance vectors, which are used as distributional supervision for training a non-autoregressive Transformer. Given text embeddings, the Transformer predicts both latent means and log-variances, while the VAE decoder reconstructs the final sign pose sequences through stochastic sampling at the decoding stage. This formulation maintains articulator-level representations by avoiding deterministic latent collapse through distributional latent modeling. In addition, we integrate a gloss attention mechanism to strengthen alignment between linguistic input and articulated motion. Experimental results show consistent gains over deterministic latent regression, achieving state-of-the-art back-translation performance and improved motion realism in a fully gloss-free setting.
Paper Structure (20 sections, 9 equations, 2 figures, 8 tables)

This paper contains 20 sections, 9 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Overview of the proposed A$^{2}$V-SLP framework. BERT-based text embeddings are mapped by a non-autoregressive Transformer to structurally disentangled latent mean and variance parameters. During training, the pretrained VAE encoder provides articulator-wise latent distributions from ground-truth poses as supervision targets. Gloss attention replaces decoder self-attention with local temporal modeling, while cross-attention to text remains global. At inference time, latent samples are drawn from the predicted distributions and decoded into sign pose sequences using the VAE decoder.
  • Figure 2: Pose sequence generated from input sentence:"am mittwoch zieht von der nordsee regen heran der sich am donnerstag ausbreitet"