Table of Contents
Fetching ...

DiffSign: AI-Assisted Generation of Customizable Sign Language Videos With Enhanced Realism

Sudha Krishnamurthy, Vimal Bhat, Abhinav Jain

TL;DR

DiffSign tackles scalable sign language video generation for the Deaf and Hard of Hearing by combining parametric pose modeling with diffusion-based signer synthesis. It retargets 2D sign poses to a 3D SMPL-X avatar, renders high-fidelity poses, and transfers them to a synthetic signer via a visual adapter-enabled diffusion model conditioned on poses, with optional multimodal prompts for appearance. The approach yields improved temporal consistency and realism over text-prompt baselines, supports zero-shot signer customization and signer anonymity through single-image conditioning, and enables diverse, region-specific signers. This work enhances accessibility for global media content and provides a practical, reproducible framework for customizable sign language video generation.

Abstract

The proliferation of several streaming services in recent years has now made it possible for a diverse audience across the world to view the same media content, such as movies or TV shows. While translation and dubbing services are being added to make content accessible to the local audience, the support for making content accessible to people with different abilities, such as the Deaf and Hard of Hearing (DHH) community, is still lagging. Our goal is to make media content more accessible to the DHH community by generating sign language videos with synthetic signers that are realistic and expressive. Using the same signer for a given media content that is viewed globally may have limited appeal. Hence, our approach combines parametric modeling and generative modeling to generate realistic-looking synthetic signers and customize their appearance based on user preferences. We first retarget human sign language poses to 3D sign language avatars by optimizing a parametric model. The high-fidelity poses from the rendered avatars are then used to condition the poses of synthetic signers generated using a diffusion-based generative model. The appearance of the synthetic signer is controlled by an image prompt supplied through a visual adapter. Our results show that the sign language videos generated using our approach have better temporal consistency and realism than signing videos generated by a diffusion model conditioned only on text prompts. We also support multimodal prompts to allow users to further customize the appearance of the signer to accommodate diversity (e.g. skin tone, gender). Our approach is also useful for signer anonymization.

DiffSign: AI-Assisted Generation of Customizable Sign Language Videos With Enhanced Realism

TL;DR

DiffSign tackles scalable sign language video generation for the Deaf and Hard of Hearing by combining parametric pose modeling with diffusion-based signer synthesis. It retargets 2D sign poses to a 3D SMPL-X avatar, renders high-fidelity poses, and transfers them to a synthetic signer via a visual adapter-enabled diffusion model conditioned on poses, with optional multimodal prompts for appearance. The approach yields improved temporal consistency and realism over text-prompt baselines, supports zero-shot signer customization and signer anonymity through single-image conditioning, and enables diverse, region-specific signers. This work enhances accessibility for global media content and provides a practical, reproducible framework for customizable sign language video generation.

Abstract

The proliferation of several streaming services in recent years has now made it possible for a diverse audience across the world to view the same media content, such as movies or TV shows. While translation and dubbing services are being added to make content accessible to the local audience, the support for making content accessible to people with different abilities, such as the Deaf and Hard of Hearing (DHH) community, is still lagging. Our goal is to make media content more accessible to the DHH community by generating sign language videos with synthetic signers that are realistic and expressive. Using the same signer for a given media content that is viewed globally may have limited appeal. Hence, our approach combines parametric modeling and generative modeling to generate realistic-looking synthetic signers and customize their appearance based on user preferences. We first retarget human sign language poses to 3D sign language avatars by optimizing a parametric model. The high-fidelity poses from the rendered avatars are then used to condition the poses of synthetic signers generated using a diffusion-based generative model. The appearance of the synthetic signer is controlled by an image prompt supplied through a visual adapter. Our results show that the sign language videos generated using our approach have better temporal consistency and realism than signing videos generated by a diffusion model conditioned only on text prompts. We also support multimodal prompts to allow users to further customize the appearance of the signer to accommodate diversity (e.g. skin tone, gender). Our approach is also useful for signer anonymization.

Paper Structure

This paper contains 17 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: High-level overview of our approach
  • Figure 2: Frame-by-frame generation using only a text prompt to control the signer appearance results in some inconsistency, especially for longer sign language videos. For example, the above frames were generated using the same seed and the control prompt "a young male with beard wearing a white shirt". Best viewed in color.
  • Figure 3: Approach combining parametric and generative modeling for customizable sign language video generation with human-like synthetic signers. Best viewed in color.
  • Figure 4: Improving consistency of signer appearance in the video by conditioning on an image using a visual adapter. Best viewed in color.
  • Figure 5: Personalizing the signer appearance by fine-tuning the diffusion model on a few images of the target signer using Dreambooth.
  • ...and 3 more figures