SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark

Zhengdi Yu; Shaoli Huang; Yongkang Cheng; Tolga Birdal

SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark

Zhengdi Yu, Shaoli Huang, Yongkang Cheng, Tolga Birdal

TL;DR

SignAvatars tackles the lack of large-scale 3D sign language data by introducing a multi-prompt 3D holistic SL dataset accompanied by an automated annotation pipeline that yields expressive body, hand, and face meshes. It further presents SignVAE, a two-stage VQ-VAE based 3D sign language production system with parallel linguistic feature generation and an autoregressive code-index generator, plus a unified benchmark for continuous, co-articulated 3D SL motion from diverse prompts. The dataset spans seven modality-based subsets and enables tasks such as 3D sign language recognition and 3D sign language production from HamNoSys, spoken language, and words. The work demonstrates improved 3D reconstruction quality and promising 3D SLP performance, offering a practical pathway to deploy 3D SL technology in education, translation, AR/VR, and broader Deaf community applications.

Abstract

We present SignAvatars, the first large-scale, multi-prompt 3D sign language (SL) motion dataset designed to bridge the communication gap for Deaf and hard-of-hearing individuals. While there has been an exponentially growing number of research regarding digital communication, the majority of existing communication technologies primarily cater to spoken or written languages, instead of SL, the essential communication method for Deaf and hard-of-hearing communities. Existing SL datasets, dictionaries, and sign language production (SLP) methods are typically limited to 2D as annotating 3D models and avatars for SL is usually an entirely manual and labor-intensive process conducted by SL experts, often resulting in unnatural avatars. In response to these challenges, we compile and curate the SignAvatars dataset, which comprises 70,000 videos from 153 signers, totaling 8.34 million frames, covering both isolated signs and continuous, co-articulated signs, with multiple prompts including HamNoSys, spoken language, and words. To yield 3D holistic annotations, including meshes and biomechanically-valid poses of body, hands, and face, as well as 2D and 3D keypoints, we introduce an automated annotation pipeline operating on our large corpus of SL videos. SignAvatars facilitates various tasks such as 3D sign language recognition (SLR) and the novel 3D SL production (SLP) from diverse inputs like text scripts, individual words, and HamNoSys notation. Hence, to evaluate the potential of SignAvatars, we further propose a unified benchmark of 3D SL holistic motion production. We believe that this work is a significant step forward towards bringing the digital world to the Deaf and hard-of-hearing communities as well as people interacting with them.

SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark

TL;DR

Abstract

Paper Structure (50 sections, 10 equations, 14 figures, 6 tables)

This paper contains 50 sections, 10 equations, 14 figures, 6 tables.

Introduction
Related Work
3D holistic mesh reconstruction
SL datasets
SL applications
SignAvatars Dataset
Overview
Dataset Characteristics
Expressive motion representation
Sign language notation
Data sources
Automatic Holistic Annotation
Holistic regularization
Biomechanical hand constraints
Hierarchical initialization
...and 35 more sections

Figures (14)

Figure 1: Overview of SignAvatars, the first publicly available, large-scale multi-prompt 3D sign language holistic motion dataset. (upper row) We introduce a generic method to automatically annotate a large corpus of video data. (lower row) We propose a 3D SLP benchmark to produce plausible 3D holistic mesh motion and provide a neural architecture as well as baselines tailored for this novel task.
Figure 2: Overview of our automatic annotation pipeline. Given an RGB image sequence as input, we perform a hierarchical initialization, followed by an optimization involving temporal smoothness and biomechanical constraints. Finally, our pipeline outputs the final 3D motion results as a sequence of SMPL-X parameters.
Figure 3: Our 3D SLP network, SignVAE, consists of two-stages. We first create semantic and motion codebooks using two VQ-VAEs, mapping inputs to their respective code indices. Then, by an auto-regressive model, we generate motion code indices based on semantic code indices, ensuring a coherent understanding of the data.
Figure 4: Comparison of 3D holistic body reconstruction. The results from PIXIE PIXIE:2021, PyMAF-X pymafx2023, and Ours on our dataset, SignAvatars.
Figure 5: Comparing 3D holistic human mesh reconstruction methods on EHF dataset SMPL-X:2019. Our annotation method produces significantly better holistic reconstructions with plausible poses and the best pixel alignment. (Zoom in for a better view)
...and 9 more figures

SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark

TL;DR

Abstract

SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (14)