Table of Contents
Fetching ...

Gloss-Free Sign Language Translation: An Unbiased Evaluation of Progress in the Field

Ozge Mercanoglu Sincan, Jian He Low, Sobhan Asasi, Richard Bowden

Abstract

Sign Language Translation (SLT) aims to automatically convert visual sign language videos into spoken language text and vice versa. While recent years have seen rapid progress, the true sources of performance improvements often remain unclear. Do reported performance gains come from methodological novelty, or from the choice of a different backbone, training optimizations, hyperparameter tuning, or even differences in the calculation of evaluation metrics? This paper presents a comprehensive study of recent gloss-free SLT models by re-implementing key contributions in a unified codebase. We ensure fair comparison by standardizing preprocessing, video encoders, and training setups across all methods. Our analysis shows that many of the performance gains reported in the literature often diminish when models are evaluated under consistent conditions, suggesting that implementation details and evaluation setups play a significant role in determining results. We make the codebase publicly available here (https://github.com/ozgemercanoglu/sltbaselines) to support transparency and reproducibility in SLT research.

Gloss-Free Sign Language Translation: An Unbiased Evaluation of Progress in the Field

Abstract

Sign Language Translation (SLT) aims to automatically convert visual sign language videos into spoken language text and vice versa. While recent years have seen rapid progress, the true sources of performance improvements often remain unclear. Do reported performance gains come from methodological novelty, or from the choice of a different backbone, training optimizations, hyperparameter tuning, or even differences in the calculation of evaluation metrics? This paper presents a comprehensive study of recent gloss-free SLT models by re-implementing key contributions in a unified codebase. We ensure fair comparison by standardizing preprocessing, video encoders, and training setups across all methods. Our analysis shows that many of the performance gains reported in the literature often diminish when models are evaluated under consistent conditions, suggesting that implementation details and evaluation setups play a significant role in determining results. We make the codebase publicly available here (https://github.com/ozgemercanoglu/sltbaselines) to support transparency and reproducibility in SLT research.
Paper Structure (18 sections, 5 figures, 6 tables)

This paper contains 18 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Reported vs reproduced results.
  • Figure 2: Comparison of sign language translation datasets in terms of gloss/vocabulary size, BLEU-4 score, dataset duration, and citation-based popularity. The x-axis represents the gloss size for datasets with gloss annotations (shown with solid borders), or vocabulary size from spoken language sentences for datasets without gloss annotations (shown with dashed borders). The y-axis shows the maximum BLEU-4 translation performance as reported in the literature. Circle sizes reflect the total number of hours. Color opacity encodes dataset popularity based on citation counts, with darker circles indicating more widely used datasets.
  • Figure 3: Architectural overview and training objectives of the compared sign language translation methods. The right side summarizes which objectives are used by each method.
  • Figure 5: Validation loss and BLEU-4 score graphs for the GFSLT-VLP baseline zhou2023gloss. For CSL-Daily, validation loss remained stable or began to rise early in training when using the same learning rate as for Phoenix-2014T, suggesting that the learning rate was too high.
  • Figure 6: BLEU-4 progression on the Phoenix-2014T dev set for FLa-LLM chen2024factorized and C2RL chen2024c during training. Each model is first pretrained and then fine-tuned starting from the best pretrained checkpoint. Fine-tuning exhibits rapid improvement, but then saturates. This behavior causes the gap between our reproduced results and those reported in the original papers.