Table of Contents
Fetching ...

A Critical Study of Automatic Evaluation in Sign Language Translation

Shakib Yazdani, Yasser Hamidullah, Cristina España-Bonet, Eleftherios Avramidis, Josef van Genabith

TL;DR

This paper tackles the challenge of evaluating sign language translation (SLT) with automatic metrics, arguing that text-only measures may fail to capture cross-modal semantics. It systematically compares lexical overlap metrics (BLEU, ROUGE, chrF), an embedding-based metric (BLEURT), and two LLM-based evaluators (G-Eval, GEMBA) under controlled paraphrasing, hallucination, and sentence-length scenarios using Phoenix-2014T and four SLT models. The findings show lexical metrics are sensitive to surface paraphrase, while BLEURT and LLM-based evaluators better reflect semantic equivalence but can be biased toward LLM-generated paraphrases and may under-penalize subtle fluent hallucinations; BLEU is highly sensitive to hallucinations. The study argues for multimodal evaluation frameworks that extend beyond text and encourages combining complementary metrics to achieve a more holistic assessment of SLT outputs with practical implications for model development.

Abstract

Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably capture the quality of SLT outputs. To address this gap, we investigate the limitations of text-based SLT evaluation metrics by analyzing six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one hand, and large language model (LLM)-based evaluators such as G-Eval and GEMBA zero-shot direct assessment on the other hand. Specifically, we assess the consistency and robustness of these metrics under three controlled conditions: paraphrasing, hallucinations in model outputs, and variations in sentence length. Our analysis highlights the limitations of lexical overlap metrics and demonstrates that while LLM-based evaluators better capture semantic equivalence often missed by conventional metrics, they can also exhibit bias toward LLM-paraphrased translations. Moreover, although all metrics are able to detect hallucinations, BLEU tends to be overly sensitive, whereas BLEURT and LLM-based evaluators are comparatively lenient toward subtle cases. This motivates the need for multimodal evaluation frameworks that extend beyond text-based metrics to enable a more holistic assessment of SLT outputs.

A Critical Study of Automatic Evaluation in Sign Language Translation

TL;DR

This paper tackles the challenge of evaluating sign language translation (SLT) with automatic metrics, arguing that text-only measures may fail to capture cross-modal semantics. It systematically compares lexical overlap metrics (BLEU, ROUGE, chrF), an embedding-based metric (BLEURT), and two LLM-based evaluators (G-Eval, GEMBA) under controlled paraphrasing, hallucination, and sentence-length scenarios using Phoenix-2014T and four SLT models. The findings show lexical metrics are sensitive to surface paraphrase, while BLEURT and LLM-based evaluators better reflect semantic equivalence but can be biased toward LLM-generated paraphrases and may under-penalize subtle fluent hallucinations; BLEU is highly sensitive to hallucinations. The study argues for multimodal evaluation frameworks that extend beyond text and encourages combining complementary metrics to achieve a more holistic assessment of SLT outputs with practical implications for model development.

Abstract

Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably capture the quality of SLT outputs. To address this gap, we investigate the limitations of text-based SLT evaluation metrics by analyzing six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one hand, and large language model (LLM)-based evaluators such as G-Eval and GEMBA zero-shot direct assessment on the other hand. Specifically, we assess the consistency and robustness of these metrics under three controlled conditions: paraphrasing, hallucinations in model outputs, and variations in sentence length. Our analysis highlights the limitations of lexical overlap metrics and demonstrates that while LLM-based evaluators better capture semantic equivalence often missed by conventional metrics, they can also exhibit bias toward LLM-paraphrased translations. Moreover, although all metrics are able to detect hallucinations, BLEU tends to be overly sensitive, whereas BLEURT and LLM-based evaluators are comparatively lenient toward subtle cases. This motivates the need for multimodal evaluation frameworks that extend beyond text-based metrics to enable a more holistic assessment of SLT outputs.

Paper Structure

This paper contains 27 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Pearson correlation between lexical metrics (BLEU, chrF, ROUGE), BLEURT, and GEMBA.
  • Figure 2: Examination of the sensitivity of evaluation metrics, including BLEU, chrF, ROUGE, GEMBA, and BLEURT, to hallucinations in SLT outputs on the Phoenix-2014T test set.
  • Figure 3: Evaluation of four SLT models (SEM-SLT, Signformer, SpaMo, TwoStream-SLT) across sentence length bins (1–6, 7–12, 13–18, 19–24, 25–31) on the Phoenix-2014T test set.