Table of Contents
Fetching ...

Towards Privacy-Aware Sign Language Translation at Scale

Phillip Rust, Bowen Shi, Skyler Wang, Necati Cihan Camgöz, Jean Maillard

TL;DR

Data scarcity and privacy concerns hinder scalable sign language translation. The authors propose SSVP-SLT, a privacy-aware two-stage framework that pretrains a sign-language encoder with masked autoencoding on anonymized video (SignHiera) and then finetunes on a curated parallel corpus, optionally adding language-supervised pretraining. On How2Sign, SSVP-SLT achieves state-of-the-art finetuned and zero-shot gloss-free SLT performance, surpassing prior baselines by about 3 BLEU points, and introduces DailyMoth-70h as a new single-signer benchmark. The work demonstrates that facial blurring provides privacy-preserving benefits with limited performance loss, highlights the substantial compute required for large-scale video pretraining, and discusses directions for expanding language coverage and privacy techniques in SLT.

Abstract

A major impediment to the advancement of sign language translation (SLT) is data scarcity. Much of the sign language data currently available on the web cannot be used for training supervised models due to the lack of aligned captions. Furthermore, scaling SLT using large-scale web-scraped datasets bears privacy risks due to the presence of biometric information, which the responsible development of SLT technologies should account for. In this work, we propose a two-stage framework for privacy-aware SLT at scale that addresses both of these issues. We introduce SSVP-SLT, which leverages self-supervised video pretraining on anonymized and unannotated videos, followed by supervised SLT finetuning on a curated parallel dataset. SSVP-SLT achieves state-of-the-art finetuned and zero-shot gloss-free SLT performance on the How2Sign dataset, outperforming the strongest respective baselines by over 3 BLEU-4. Based on controlled experiments, we further discuss the advantages and limitations of self-supervised pretraining and anonymization via facial obfuscation for SLT.

Towards Privacy-Aware Sign Language Translation at Scale

TL;DR

Data scarcity and privacy concerns hinder scalable sign language translation. The authors propose SSVP-SLT, a privacy-aware two-stage framework that pretrains a sign-language encoder with masked autoencoding on anonymized video (SignHiera) and then finetunes on a curated parallel corpus, optionally adding language-supervised pretraining. On How2Sign, SSVP-SLT achieves state-of-the-art finetuned and zero-shot gloss-free SLT performance, surpassing prior baselines by about 3 BLEU points, and introduces DailyMoth-70h as a new single-signer benchmark. The work demonstrates that facial blurring provides privacy-preserving benefits with limited performance loss, highlights the substantial compute required for large-scale video pretraining, and discusses directions for expanding language coverage and privacy techniques in SLT.

Abstract

A major impediment to the advancement of sign language translation (SLT) is data scarcity. Much of the sign language data currently available on the web cannot be used for training supervised models due to the lack of aligned captions. Furthermore, scaling SLT using large-scale web-scraped datasets bears privacy risks due to the presence of biometric information, which the responsible development of SLT technologies should account for. In this work, we propose a two-stage framework for privacy-aware SLT at scale that addresses both of these issues. We introduce SSVP-SLT, which leverages self-supervised video pretraining on anonymized and unannotated videos, followed by supervised SLT finetuning on a curated parallel dataset. SSVP-SLT achieves state-of-the-art finetuned and zero-shot gloss-free SLT performance on the How2Sign dataset, outperforming the strongest respective baselines by over 3 BLEU-4. Based on controlled experiments, we further discuss the advantages and limitations of self-supervised pretraining and anonymization via facial obfuscation for SLT.
Paper Structure (47 sections, 5 figures, 13 tables)

This paper contains 47 sections, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Overview of our two-stage SSVP-SLT method. The first stage consists of training a SignHiera encoder via masked autoencoding (MAE) on blurred video frames. In the second stage, a pretrained T5 model is finetuned for SLT while the pretrained SignHiera is kept frozen (). The input video in the second stage can be unblurred.
  • Figure 2: Overview of our LSP extension.
  • Figure 3: How2Sign test BLEU of SSVP-SLT after pretraining on YouTube-ASL and How2Sign or How2Sign only and finetuning on the same data.
  • Figure 4: DailyMoth-70h dataset split statistics
  • Figure 5: Example frames sampled from two videos in the blurred version of the DailyMoth-70h training split.