Table of Contents
Fetching ...

Fingerspelling within Sign Language Translation

Garrett Tanzer

TL;DR

This work tackles the challenge of fingerspelling integration within ASL-to-English translation by introducing a dedicated evaluation protocol (FLEURS-ASL-FS) and two concrete interventions: character-level tokenization via ByT5 and cotraining with a fingerspelling recognition dataset (FSboard). The authors annotate 1749 FLEURS-ASL sentences to identify fingerspelled spans, enabling precise span-based evaluation of translation outputs. Results show that ByT5’s character-level representation yields substantial improvements in overall translation quality and in the accuracy of fingerspelled terms within translations, while cotraining with FSboard data provides mixed gains. The study advocates for adopting character-level tokenization as a standard practice in sign-language translation and provides a scalable evaluation framework to examine fingerspelling within translation more broadly.

Abstract

Fingerspelling poses challenges for sign language processing due to its high-frequency motion and use for open-vocabulary terms. While prior work has studied fingerspelling recognition, there has been little attention to evaluating how well sign language translation models understand fingerspelling in the context of entire sentences -- and improving this capability. We manually annotate instances of fingerspelling within FLEURS-ASL and use them to evaluate the effect of two simple measures to improve fingerspelling recognition within American Sign Language to English translation: 1) use a model family (ByT5) with character- rather than subword-level tokenization, and 2) mix fingerspelling recognition data into the translation training mixture. We find that 1) substantially improves understanding of fingerspelling (and therefore translation quality overall), but the effect of 2) is mixed.

Fingerspelling within Sign Language Translation

TL;DR

This work tackles the challenge of fingerspelling integration within ASL-to-English translation by introducing a dedicated evaluation protocol (FLEURS-ASL-FS) and two concrete interventions: character-level tokenization via ByT5 and cotraining with a fingerspelling recognition dataset (FSboard). The authors annotate 1749 FLEURS-ASL sentences to identify fingerspelled spans, enabling precise span-based evaluation of translation outputs. Results show that ByT5’s character-level representation yields substantial improvements in overall translation quality and in the accuracy of fingerspelled terms within translations, while cotraining with FSboard data provides mixed gains. The study advocates for adopting character-level tokenization as a standard practice in sign-language translation and provides a scalable evaluation framework to examine fingerspelling within translation more broadly.

Abstract

Fingerspelling poses challenges for sign language processing due to its high-frequency motion and use for open-vocabulary terms. While prior work has studied fingerspelling recognition, there has been little attention to evaluating how well sign language translation models understand fingerspelling in the context of entire sentences -- and improving this capability. We manually annotate instances of fingerspelling within FLEURS-ASL and use them to evaluate the effect of two simple measures to improve fingerspelling recognition within American Sign Language to English translation: 1) use a model family (ByT5) with character- rather than subword-level tokenization, and 2) mix fingerspelling recognition data into the translation training mixture. We find that 1) substantially improves understanding of fingerspelling (and therefore translation quality overall), but the effect of 2) is mixed.
Paper Structure (15 sections, 2 figures, 10 tables)

This paper contains 15 sections, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Visual depiction of subword- vs. character-level tokenization as it relates to sign language translation. Color highlights are a visual aid for token boundaries. We omit faces in the figure for privacy. Fingerspelled spans such as "avian influenza" within larger sentences must be mapped to sequences like <avi><an>< influenza> in the T5 subword vocabulary; the model can only know how these tokens are spelled through fingerspelling data coverage (less likely) or from text pretraining (more likely). Character-level tokenization makes the mapping much more straightforward, at the expense of increasing sequence length for the rest of the sentence.
  • Figure 2: Framing for the span extraction task performed by an LLM in our evaluation framework. The LLM sees a) a reference translation, b) the SLT model's predicted translation, and c) a list of fingerspelled terms known to be in the original video. Then it identifies the spans within the predicted translation that best correspond to the fingerspelled terms (or "" for terms where none are found). The role of the fingerspelled terms within the reference sentence often helps to identify the correspondence in the predicted translation when the character-level correspondence is weak. See Appendix \ref{['app:prompt']} for the full prompt, which includes instructions and 3 examples in context.