Diversity-Aware Sign Language Production through a Pose Encoding Variational Autoencoder
Mohamed Ilyes Lakhal, Richard Bowden
TL;DR
The paper tackles diversity-aware sign language production by introducing Pose-Encoding Variational Inference (PE-VAE) and a UNet-based generator (PENet) that synthesizes signer images conditioned on a 2D pose and demographic attributes. PE-VAE learns pose-agnostic appearance features and integrates attribute information via a multi-head attention fusion, while PENet uses per-body-part decoders and an edge-aware loss to preserve spatial fidelity and high-frequency details. Evaluations on the SMILE dataset show improved image quality and diversity, with better pose estimation and non-manual feature fidelity compared to state-of-the-art baselines. The approach enables anonymized, controllable sign language synthesis with potential for data augmentation and inclusive representation in sign-language technologies.
Abstract
This paper addresses the problem of diversity-aware sign language production, where we want to give an image (or sequence) of a signer and produce another image with the same pose but different attributes (\textit{e.g.} gender, skin color). To this end, we extend the variational inference paradigm to include information about the pose and the conditioning of the attributes. This formulation improves the quality of the synthesised images. The generator framework is presented as a UNet architecture to ensure spatial preservation of the input pose, and we include the visual features from the variational inference to maintain control over appearance and style. We generate each body part with a separate decoder. This architecture allows the generator to deliver better overall results. Experiments on the SMILE II dataset show that the proposed model performs quantitatively better than state-of-the-art baselines regarding diversity, per-pixel image quality, and pose estimation. Quantitatively, it faithfully reproduces non-manual features for signers.
