Table of Contents
Fetching ...

Diversity-Aware Sign Language Production through a Pose Encoding Variational Autoencoder

Mohamed Ilyes Lakhal, Richard Bowden

TL;DR

The paper tackles diversity-aware sign language production by introducing Pose-Encoding Variational Inference (PE-VAE) and a UNet-based generator (PENet) that synthesizes signer images conditioned on a 2D pose and demographic attributes. PE-VAE learns pose-agnostic appearance features and integrates attribute information via a multi-head attention fusion, while PENet uses per-body-part decoders and an edge-aware loss to preserve spatial fidelity and high-frequency details. Evaluations on the SMILE dataset show improved image quality and diversity, with better pose estimation and non-manual feature fidelity compared to state-of-the-art baselines. The approach enables anonymized, controllable sign language synthesis with potential for data augmentation and inclusive representation in sign-language technologies.

Abstract

This paper addresses the problem of diversity-aware sign language production, where we want to give an image (or sequence) of a signer and produce another image with the same pose but different attributes (\textit{e.g.} gender, skin color). To this end, we extend the variational inference paradigm to include information about the pose and the conditioning of the attributes. This formulation improves the quality of the synthesised images. The generator framework is presented as a UNet architecture to ensure spatial preservation of the input pose, and we include the visual features from the variational inference to maintain control over appearance and style. We generate each body part with a separate decoder. This architecture allows the generator to deliver better overall results. Experiments on the SMILE II dataset show that the proposed model performs quantitatively better than state-of-the-art baselines regarding diversity, per-pixel image quality, and pose estimation. Quantitatively, it faithfully reproduces non-manual features for signers.

Diversity-Aware Sign Language Production through a Pose Encoding Variational Autoencoder

TL;DR

The paper tackles diversity-aware sign language production by introducing Pose-Encoding Variational Inference (PE-VAE) and a UNet-based generator (PENet) that synthesizes signer images conditioned on a 2D pose and demographic attributes. PE-VAE learns pose-agnostic appearance features and integrates attribute information via a multi-head attention fusion, while PENet uses per-body-part decoders and an edge-aware loss to preserve spatial fidelity and high-frequency details. Evaluations on the SMILE dataset show improved image quality and diversity, with better pose estimation and non-manual feature fidelity compared to state-of-the-art baselines. The approach enables anonymized, controllable sign language synthesis with potential for data augmentation and inclusive representation in sign-language technologies.

Abstract

This paper addresses the problem of diversity-aware sign language production, where we want to give an image (or sequence) of a signer and produce another image with the same pose but different attributes (\textit{e.g.} gender, skin color). To this end, we extend the variational inference paradigm to include information about the pose and the conditioning of the attributes. This formulation improves the quality of the synthesised images. The generator framework is presented as a UNet architecture to ensure spatial preservation of the input pose, and we include the visual features from the variational inference to maintain control over appearance and style. We generate each body part with a separate decoder. This architecture allows the generator to deliver better overall results. Experiments on the SMILE II dataset show that the proposed model performs quantitatively better than state-of-the-art baselines regarding diversity, per-pixel image quality, and pose estimation. Quantitatively, it faithfully reproduces non-manual features for signers.
Paper Structure (13 sections, 12 equations, 12 figures, 2 tables)

This paper contains 13 sections, 12 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Diversity-Aware Sign Language Production. Given an input image of a signer, we would like to synthesize a novel (unseen) image of another signer given an attribute (e.g., ethnicity) and a corresponding pose. KEY -- top-row: given a Japanese signer, we produce a Swiss signer with the same pose (Japanese $\to$ Swiss); bottom-row: Swiss $\to$ Japanese.
  • Figure 2: PENet. The network is presented as a conditional VAE-GAN, where the variational parts learn the distribution of visual feature from signers of different attributes (skin tone, ethnicity, gender) through variational inference. The attribute $a$ is presented as a feature vector extracted from a pre-trained CLIP model. The latent code $z$ and $a$ are combined through a MHA module. The pose $y$ is processed through a UNet encoder-decoder network to retain the spacial information of the keypoints, the visual feature $z_a$ guides the synthesis of the person through a mapping $\Psi$ (Eq. \ref{['psi_z']}).
  • Figure 3: Pose Aggregation Module. Using a text prompt of an attribute (e.g., gender), we extract a feature $a \in \mathbb{R}^{512}$ from a pre-trained CLIP model Radford_2021_arxiv. This feature is then concatenated with the latent code $z$ and fed into a multi-layer attention module.
  • Figure 4: Edge loss. Effect of the edge loss on the synthesised frames. We show the heatmaps of the same frame with and without using the edge loss $L_{\textbf{edge}}$. Notice the errors when using $L_{\textbf{edge}}$ come mainly from the hand of the clothes i.e., not in the boundaries of the body.
  • Figure 5: Example showing the effect of skip connections and injecting the appearance feature $\mathbf{z}_a$ using $\Psi$ in the decoder.
  • ...and 7 more figures