Table of Contents
Fetching ...

Meaningful Pose-Based Sign Language Evaluation

Zifan Jiang, Colin Leong, Amit Moryossef, Anne Göhring, Annette Rios, Oliver Cory, Maksym Ivashechkin, Neha Tarigopula, Biao Zhang, Rico Sennrich, Sarah Ebling

TL;DR

This work addresses the challenge of meaningfully evaluating sign language output represented as human skeletal poses by systematically analyzing distance-based, embedding-based, and back-translation-based metrics. It proposes a unified evaluation framework and an open-source pose-evaluation toolkit, enabling reproducible comparisons across sign-language generation and translation systems. Through automatic meta-evaluation on sign retrieval and a large-scale human-correlation study across three sign languages, the authors find that carefully tuned distance-based metrics (notably DTW$p$ and nDTW$p$) and back-translation likelihood show strong agreement with human judgments, while embedding-based measures excel in sign-level tasks but lag at sentence-level translation. The work emphasizes the need for open pose-to-text models, standardized tooling, and future extensions to phrase-level evaluation, ultimately providing practical guidelines and resources to advance robust, reusable evaluation in sign language processing.

Abstract

We present a comprehensive study on meaningfully evaluating sign language utterances in the form of human skeletal poses. The study covers keypoint distance-based, embedding-based, and back-translation-based metrics. We show tradeoffs between different metrics in different scenarios through automatic meta-evaluation of sign-level retrieval and a human correlation study of text-to-pose translation across different sign languages. Our findings and the open-source pose-evaluation toolkit provide a practical and reproducible way of developing and evaluating sign language translation or generation systems.

Meaningful Pose-Based Sign Language Evaluation

TL;DR

This work addresses the challenge of meaningfully evaluating sign language output represented as human skeletal poses by systematically analyzing distance-based, embedding-based, and back-translation-based metrics. It proposes a unified evaluation framework and an open-source pose-evaluation toolkit, enabling reproducible comparisons across sign-language generation and translation systems. Through automatic meta-evaluation on sign retrieval and a large-scale human-correlation study across three sign languages, the authors find that carefully tuned distance-based metrics (notably DTW and nDTW) and back-translation likelihood show strong agreement with human judgments, while embedding-based measures excel in sign-level tasks but lag at sentence-level translation. The work emphasizes the need for open pose-to-text models, standardized tooling, and future extensions to phrase-level evaluation, ultimately providing practical guidelines and resources to advance robust, reusable evaluation in sign language processing.

Abstract

We present a comprehensive study on meaningfully evaluating sign language utterances in the form of human skeletal poses. The study covers keypoint distance-based, embedding-based, and back-translation-based metrics. We show tradeoffs between different metrics in different scenarios through automatic meta-evaluation of sign-level retrieval and a human correlation study of text-to-pose translation across different sign languages. Our findings and the open-source pose-evaluation toolkit provide a practical and reproducible way of developing and evaluating sign language translation or generation systems.

Paper Structure

This paper contains 44 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Pose‐based evaluation taxonomy overview. We compare a reference and a hypothesis pose sequence by one of the following three ways: (a) computing distance‐based metrics directly on the keypoint sequences, optionally aligned by dynamic time wrapping (DTW); (b) encoding each sequence into a shared embedding space and measuring similarity; and (c) back‐translating the hypothesis poses into text to apply conventional machine translation metrics on text.
  • Figure 2: MediaPipe keypoint selection strategies.
  • Figure 3: Sequence alignment (in green) between a shorter sequence (in red) and longer sequence (in blue). In reality, pose keypoint trajectories are aligned temporally in 3D and then averaged for the whole body. Paddings take values from the first frame or simply 0s.
  • Figure 4: A screenshot of an example text-to-pose evaluation task in Appraise featuring sentence-level source-based direct assessment with custom annotator guidelines in German/French/Italian and DSGS/LSF/LIS, translated into English for readers' convenience.