Meaningful Pose-Based Sign Language Evaluation
Zifan Jiang, Colin Leong, Amit Moryossef, Anne Göhring, Annette Rios, Oliver Cory, Maksym Ivashechkin, Neha Tarigopula, Biao Zhang, Rico Sennrich, Sarah Ebling
TL;DR
This work addresses the challenge of meaningfully evaluating sign language output represented as human skeletal poses by systematically analyzing distance-based, embedding-based, and back-translation-based metrics. It proposes a unified evaluation framework and an open-source pose-evaluation toolkit, enabling reproducible comparisons across sign-language generation and translation systems. Through automatic meta-evaluation on sign retrieval and a large-scale human-correlation study across three sign languages, the authors find that carefully tuned distance-based metrics (notably DTW$p$ and nDTW$p$) and back-translation likelihood show strong agreement with human judgments, while embedding-based measures excel in sign-level tasks but lag at sentence-level translation. The work emphasizes the need for open pose-to-text models, standardized tooling, and future extensions to phrase-level evaluation, ultimately providing practical guidelines and resources to advance robust, reusable evaluation in sign language processing.
Abstract
We present a comprehensive study on meaningfully evaluating sign language utterances in the form of human skeletal poses. The study covers keypoint distance-based, embedding-based, and back-translation-based metrics. We show tradeoffs between different metrics in different scenarios through automatic meta-evaluation of sign-level retrieval and a human correlation study of text-to-pose translation across different sign languages. Our findings and the open-source pose-evaluation toolkit provide a practical and reproducible way of developing and evaluating sign language translation or generation systems.
