Table of Contents
Fetching ...

Cross-modality Data Augmentation for End-to-End Sign Language Translation

Jinhui Ye, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Hui Xiong

TL;DR

The paper tackles end-to-end sign language translation by addressing the modality gap between sign videos and spoken text and by mitigating data scarcity. It introduces Cross-modality Data Augmentation (XmDA), which combines cross-modality mix-up (bridging sign video features and gloss embeddings) with cross-modality knowledge distillation (soft-guided targets from multiple gloss-to-text teachers). Evaluations on PHOENIX-2014T and CSL-Daily show consistent improvements over baselines in BLEU, ROUGE, and ChrF, with notable gains in handling low-frequency words and long sentences. XmDA offers a resource-efficient means to boost video-to-text SLT without additional data, by effectively transferring gloss-to-text translation strengths to end-to-end SLT.

Abstract

End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations. It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data. Due to these challenges, the input and output distributions of end-to-end sign language translation (i.e., video-to-text) are less effective compared to the gloss-to-text approach (i.e., text-to-text). To tackle these challenges, we propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation (i.e. video-to-text) by exploiting pseudo gloss-text pairs from the sign gloss translation model. Specifically, XmDA consists of two key components, namely, cross-modality mix-up and cross-modality knowledge distillation. The former explicitly encourages the alignment between sign video features and gloss embeddings to bridge the modality gap. The latter utilizes the generation knowledge from gloss-to-text teacher models to guide the spoken language text generation. Experimental results on two widely used SLT datasets, i.e., PHOENIX-2014T and CSL-Daily, demonstrate that the proposed XmDA framework significantly and consistently outperforms the baseline models. Extensive analyses confirm our claim that XmDA enhances spoken language text generation by reducing the representation distance between videos and texts, as well as improving the processing of low-frequency words and long sentences.

Cross-modality Data Augmentation for End-to-End Sign Language Translation

TL;DR

The paper tackles end-to-end sign language translation by addressing the modality gap between sign videos and spoken text and by mitigating data scarcity. It introduces Cross-modality Data Augmentation (XmDA), which combines cross-modality mix-up (bridging sign video features and gloss embeddings) with cross-modality knowledge distillation (soft-guided targets from multiple gloss-to-text teachers). Evaluations on PHOENIX-2014T and CSL-Daily show consistent improvements over baselines in BLEU, ROUGE, and ChrF, with notable gains in handling low-frequency words and long sentences. XmDA offers a resource-efficient means to boost video-to-text SLT without additional data, by effectively transferring gloss-to-text translation strengths to end-to-end SLT.

Abstract

End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations. It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data. Due to these challenges, the input and output distributions of end-to-end sign language translation (i.e., video-to-text) are less effective compared to the gloss-to-text approach (i.e., text-to-text). To tackle these challenges, we propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation (i.e. video-to-text) by exploiting pseudo gloss-text pairs from the sign gloss translation model. Specifically, XmDA consists of two key components, namely, cross-modality mix-up and cross-modality knowledge distillation. The former explicitly encourages the alignment between sign video features and gloss embeddings to bridge the modality gap. The latter utilizes the generation knowledge from gloss-to-text teacher models to guide the spoken language text generation. Experimental results on two widely used SLT datasets, i.e., PHOENIX-2014T and CSL-Daily, demonstrate that the proposed XmDA framework significantly and consistently outperforms the baseline models. Extensive analyses confirm our claim that XmDA enhances spoken language text generation by reducing the representation distance between videos and texts, as well as improving the processing of low-frequency words and long sentences.
Paper Structure (37 sections, 8 equations, 5 figures, 15 tables)

This paper contains 37 sections, 8 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: The illustration shows that the Gloss-to-text, i.e., text-to-text, model has more distinct embeddings and better output predictions compared to the sign language translation, i.e., video-to-text.
  • Figure 2: The overall framework of cross-modality data augmentation methods for SLT in this work. Components in gray indicate frozen parameters.
  • Figure 3: Bivariate kernel density estimation visualization of sentence-level representations: sign embeddings from baseline SLT, gloss embeddings from the gloss-to-text teacher model, and mixed-modal representations obtained by mixing sign embeddings and gloss embeddings with $\lambda=0.6$. Best viewed in color.
  • Figure 4: Visualization of gloss and sign representation distributions for the Baseline SLT (in blue) and "+ Cross-modality Mix-up" (in green) models by t-SNE. Best viewed in color.
  • Figure 5: BLEU score of "+ Cross-modality Mix-up" on PHOENIX-2014T dev set, with different mix-up ratio $\lambda$. When $\lambda=0.0$, "+ Cross-modality Mix-up" degrades to the baseline model.