DiffSign: AI-Assisted Generation of Customizable Sign Language Videos With Enhanced Realism
Sudha Krishnamurthy, Vimal Bhat, Abhinav Jain
TL;DR
DiffSign tackles scalable sign language video generation for the Deaf and Hard of Hearing by combining parametric pose modeling with diffusion-based signer synthesis. It retargets 2D sign poses to a 3D SMPL-X avatar, renders high-fidelity poses, and transfers them to a synthetic signer via a visual adapter-enabled diffusion model conditioned on poses, with optional multimodal prompts for appearance. The approach yields improved temporal consistency and realism over text-prompt baselines, supports zero-shot signer customization and signer anonymity through single-image conditioning, and enables diverse, region-specific signers. This work enhances accessibility for global media content and provides a practical, reproducible framework for customizable sign language video generation.
Abstract
The proliferation of several streaming services in recent years has now made it possible for a diverse audience across the world to view the same media content, such as movies or TV shows. While translation and dubbing services are being added to make content accessible to the local audience, the support for making content accessible to people with different abilities, such as the Deaf and Hard of Hearing (DHH) community, is still lagging. Our goal is to make media content more accessible to the DHH community by generating sign language videos with synthetic signers that are realistic and expressive. Using the same signer for a given media content that is viewed globally may have limited appeal. Hence, our approach combines parametric modeling and generative modeling to generate realistic-looking synthetic signers and customize their appearance based on user preferences. We first retarget human sign language poses to 3D sign language avatars by optimizing a parametric model. The high-fidelity poses from the rendered avatars are then used to condition the poses of synthetic signers generated using a diffusion-based generative model. The appearance of the synthetic signer is controlled by an image prompt supplied through a visual adapter. Our results show that the sign language videos generated using our approach have better temporal consistency and realism than signing videos generated by a diffusion model conditioned only on text prompts. We also support multimodal prompts to allow users to further customize the appearance of the signer to accommodate diversity (e.g. skin tone, gender). Our approach is also useful for signer anonymization.
