Table of Contents
Fetching ...

Saudi Sign Language Translation Using T5

Ali Alhejab, Tomas Zelezny, Lamya Alkanhal, Ivan Gruber, Yazeed Alharbi, Jakub Straka, Vaclav Javorek, Marek Hruz, Badriah Alkalifah, Ahmed Ali

TL;DR

This work tackles Saudi Sign Language (SSL) translation under data scarcity by applying a T5-based model to a novel SSL dataset with three testing protocols. It introduces a pose-based video preprocessing pipeline and demonstrates that pre-training on a large ASL dataset (YouTubeASL) substantially improves SSL translation, achieving roughly $3\times$ gains in BLEU-4. The study compares multiple model variants (T5-base, T5v1.1-base, mT5-base) and languages (English/Arabic), showing that cross-lingual transfer from ASL to SSL enhances generalization to unseen signers and sentences. SSL-specific challenges like face occlusion and gender imbalance are discussed, and the results support cross-lingual pre-training as a viable path for low-resource sign languages. The work provides a reproducible pipeline, extensive experiments, and publicly available code to advance SSL translation systems.

Abstract

This paper explores the application of T5 models for Saudi Sign Language (SSL) translation using a novel dataset. The SSL dataset includes three challenging testing protocols, enabling comprehensive evaluation across different scenarios. Additionally, it captures unique SSL characteristics, such as face coverings, which pose challenges for sign recognition and translation. In our experiments, we investigate the impact of pre-training on American Sign Language (ASL) data by comparing T5 models pre-trained on the YouTubeASL dataset with models trained directly on the SSL dataset. Experimental results demonstrate that pre-training on YouTubeASL significantly improves models' performance (roughly $3\times$ in BLEU-4), indicating cross-linguistic transferability in sign language models. Our findings highlight the benefits of leveraging large-scale ASL data to improve SSL translation and provide insights into the development of more effective sign language translation systems. Our code is publicly available at our GitHub repository.

Saudi Sign Language Translation Using T5

TL;DR

This work tackles Saudi Sign Language (SSL) translation under data scarcity by applying a T5-based model to a novel SSL dataset with three testing protocols. It introduces a pose-based video preprocessing pipeline and demonstrates that pre-training on a large ASL dataset (YouTubeASL) substantially improves SSL translation, achieving roughly gains in BLEU-4. The study compares multiple model variants (T5-base, T5v1.1-base, mT5-base) and languages (English/Arabic), showing that cross-lingual transfer from ASL to SSL enhances generalization to unseen signers and sentences. SSL-specific challenges like face occlusion and gender imbalance are discussed, and the results support cross-lingual pre-training as a viable path for low-resource sign languages. The work provides a reproducible pipeline, extensive experiments, and publicly available code to advance SSL translation systems.

Abstract

This paper explores the application of T5 models for Saudi Sign Language (SSL) translation using a novel dataset. The SSL dataset includes three challenging testing protocols, enabling comprehensive evaluation across different scenarios. Additionally, it captures unique SSL characteristics, such as face coverings, which pose challenges for sign recognition and translation. In our experiments, we investigate the impact of pre-training on American Sign Language (ASL) data by comparing T5 models pre-trained on the YouTubeASL dataset with models trained directly on the SSL dataset. Experimental results demonstrate that pre-training on YouTubeASL significantly improves models' performance (roughly in BLEU-4), indicating cross-linguistic transferability in sign language models. Our findings highlight the benefits of leveraging large-scale ASL data to improve SSL translation and provide insights into the development of more effective sign language translation systems. Our code is publicly available at our GitHub repository.

Paper Structure

This paper contains 19 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Histogram of Signing Duration and Word Length.
  • Figure 2: We use only a subset of the keypoints extracted by MediaPipe. (a) shows all keypoints extracted by the individual MediaPipe models for the body, face, and hands. (b) shows the subset of keypoints that are used as input to our model.
  • Figure 3: Video preprocessing based on sign space. (a) illustration of sign space in input frame, (b) cropped and padded frame.