Table of Contents
Fetching ...

Unsupervised Sign Language Translation and Generation

Zhengsheng Guo, Zhiwei He, Wenxiang Jiao, Xing Wang, Rui Wang, Kehai Chen, Zhaopeng Tu, Yong Xu, Min Zhang

TL;DR

This work addresses the scarcity of parallel sign-language data by introducing USLNet, an unsupervised network that jointly learns sign language translation and generation from single-modality text and video data. The model combines a Text Reconstruction Module, a Video Reconstruction Module based on VideoGPT with VQ-VAE quantization, and two cross-modality back-translation paths (T2V2T-BT and V2T2V-BT) guided by a Sliding Window Aligner to bridge length and feature-dimension gaps, optimizing a composite objective $\\mathcal{L}_{overall} = \alpha_1 \\mathcal{L}_{text} + \alpha_2 \\mathcal{L}_{codebook} + \\alpha_3 \\mathcal{L}_{video} + \\alpha_4 \\mathcal{L}_{T2V2T} + \\alpha_5 \\mathcal{L}_{V2T2V}$. The approach is validated on BOBSL and OpenASL, achieving competitive results with supervised baselines and demonstrating that unsupervised pretraining plus fine-tuning can yield state-of-the-art performance on challenging sign-language datasets. By unifying translation and generation within a single framework and leveraging cross-modal back-translation, USLNet advances practical SLTG capabilities without requiring parallel video-text corpora. This has potential impact for accessibility technologies and inclusive communication where annotated sign-language resources are scarce. Future work may focus on scaling to larger sign languages and further optimizing the cross-modal alignment components.

Abstract

Motivated by the success of unsupervised neural machine translation (UNMT), we introduce an unsupervised sign language translation and generation network (USLNet), which learns from abundant single-modality (text and video) data without parallel sign language data. USLNet comprises two main components: single-modality reconstruction modules (text and video) that rebuild the input from its noisy version in the same modality and cross-modality back-translation modules (text-video-text and video-text-video) that reconstruct the input from its noisy version in the different modality using back-translation procedure.Unlike the single-modality back-translation procedure in text-based UNMT, USLNet faces the cross-modality discrepancy in feature representation, in which the length and the feature dimension mismatch between text and video sequences. We propose a sliding window method to address the issues of aligning variable-length text with video sequences. To our knowledge, USLNet is the first unsupervised sign language translation and generation model capable of generating both natural language text and sign language video in a unified manner. Experimental results on the BBC-Oxford Sign Language dataset (BOBSL) and Open-Domain American Sign Language dataset (OpenASL) reveal that USLNet achieves competitive results compared to supervised baseline models, indicating its effectiveness in sign language translation and generation.

Unsupervised Sign Language Translation and Generation

TL;DR

This work addresses the scarcity of parallel sign-language data by introducing USLNet, an unsupervised network that jointly learns sign language translation and generation from single-modality text and video data. The model combines a Text Reconstruction Module, a Video Reconstruction Module based on VideoGPT with VQ-VAE quantization, and two cross-modality back-translation paths (T2V2T-BT and V2T2V-BT) guided by a Sliding Window Aligner to bridge length and feature-dimension gaps, optimizing a composite objective . The approach is validated on BOBSL and OpenASL, achieving competitive results with supervised baselines and demonstrating that unsupervised pretraining plus fine-tuning can yield state-of-the-art performance on challenging sign-language datasets. By unifying translation and generation within a single framework and leveraging cross-modal back-translation, USLNet advances practical SLTG capabilities without requiring parallel video-text corpora. This has potential impact for accessibility technologies and inclusive communication where annotated sign-language resources are scarce. Future work may focus on scaling to larger sign languages and further optimizing the cross-modal alignment components.

Abstract

Motivated by the success of unsupervised neural machine translation (UNMT), we introduce an unsupervised sign language translation and generation network (USLNet), which learns from abundant single-modality (text and video) data without parallel sign language data. USLNet comprises two main components: single-modality reconstruction modules (text and video) that rebuild the input from its noisy version in the same modality and cross-modality back-translation modules (text-video-text and video-text-video) that reconstruct the input from its noisy version in the different modality using back-translation procedure.Unlike the single-modality back-translation procedure in text-based UNMT, USLNet faces the cross-modality discrepancy in feature representation, in which the length and the feature dimension mismatch between text and video sequences. We propose a sliding window method to address the issues of aligning variable-length text with video sequences. To our knowledge, USLNet is the first unsupervised sign language translation and generation model capable of generating both natural language text and sign language video in a unified manner. Experimental results on the BBC-Oxford Sign Language dataset (BOBSL) and Open-Domain American Sign Language dataset (OpenASL) reveal that USLNet achieves competitive results compared to supervised baseline models, indicating its effectiveness in sign language translation and generation.
Paper Structure (44 sections, 12 equations, 7 figures, 8 tables)

This paper contains 44 sections, 12 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: A figure describing sign video reconstruction module. This module is responsible for reconstructing the original video from the downsampled discrete latent representations of raw video data. In the quantization stage, the module transforms the video embeddings into discrete video tokens using a codebook. These video tokens are then input into GPT to generate the next visual token.
  • Figure 2: The overall framework of the proposed USLNet. The gray line denotes the text reconstruction procedure. The blue line denotes the video reconstruction procedure . The yellow line denotes the sign language translation procedure which translates video into the corresponding text. The red line denotes the sign language generation procedure which translates text into the corresponding video.
  • Figure 3: Left: A figure describing slide window aligner at step one. Right: Visualization of the probability distribution (Gaussian distribution) that satisfies the weight coefficients of words in different positions. At step one, we compute the first token "a" of pseudo video "sequence" by slide window aligner.
  • Figure 4: A figure describing the procedure of cross-modality back-translation. The left sub-figure depicts the Text-Video-Text Back-Translation (T2V2T-BT) procedure, while the right sub-figure showcases the Video-Text-Video Back-Translation (V2T2V-BT) procedure. Each sub-figure provides a step-by-step description of the respective back-translation process. The numbers assigned next to the arrows indicate the sequential order of the steps. For instance, "2" signifies that the step is the second step in the procedure.
  • Figure 5: Case study of UnSLNet on BOBSL for sign language generation task. Examples are from test set.
  • ...and 2 more figures