Table of Contents
Fetching ...

Reconsidering Sentence-Level Sign Language Translation

Garrett Tanzer, Maximus Shengelia, Ken Harrenstien, David Uthus

TL;DR

This paper argues that translating sign languages at the sentence level neglects discourse-level dependencies inherent to the visual-spatial modality. It surveys long-range linguistic phenomena and introduces the first human baseline for ASL→English translation on How2Sign, revealing that about one-third of instances require additional discourse context for proper understanding. The findings show only modest gains from adding context to sentence-level translations and reveal substantial inter-interpreter variation and alignment challenges. Collectively, the work advocates rethinking task framing and benchmarks to better capture sign-language discourse and to mitigate biases in dataset design.

Abstract

Historically, sign language machine translation has been posed as a sentence-level task: datasets consisting of continuous narratives are chopped up and presented to the model as isolated clips. In this work, we explore the limitations of this task framing. First, we survey a number of linguistic phenomena in sign languages that depend on discourse-level context. Then as a case study, we perform the first human baseline for sign language translation that actually substitutes a human into the machine learning task framing, rather than provide the human with the entire document as context. This human baseline -- for ASL to English translation on the How2Sign dataset -- shows that for 33% of sentences in our sample, our fluent Deaf signer annotators were only able to understand key parts of the clip in light of additional discourse-level context. These results underscore the importance of understanding and sanity checking examples when adapting machine learning to new domains.

Reconsidering Sentence-Level Sign Language Translation

TL;DR

This paper argues that translating sign languages at the sentence level neglects discourse-level dependencies inherent to the visual-spatial modality. It surveys long-range linguistic phenomena and introduces the first human baseline for ASL→English translation on How2Sign, revealing that about one-third of instances require additional discourse context for proper understanding. The findings show only modest gains from adding context to sentence-level translations and reveal substantial inter-interpreter variation and alignment challenges. Collectively, the work advocates rethinking task framing and benchmarks to better capture sign-language discourse and to mitigate biases in dataset design.

Abstract

Historically, sign language machine translation has been posed as a sentence-level task: datasets consisting of continuous narratives are chopped up and presented to the model as isolated clips. In this work, we explore the limitations of this task framing. First, we survey a number of linguistic phenomena in sign languages that depend on discourse-level context. Then as a case study, we perform the first human baseline for sign language translation that actually substitutes a human into the machine learning task framing, rather than provide the human with the entire document as context. This human baseline -- for ASL to English translation on the How2Sign dataset -- shows that for 33% of sentences in our sample, our fluent Deaf signer annotators were only able to understand key parts of the clip in light of additional discourse-level context. These results underscore the importance of understanding and sanity checking examples when adapting machine learning to new domains.
Paper Structure (17 sections, 2 figures, 3 tables)

This paper contains 17 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Example of the interaction between classifiers and long-range context. It isn't clear in isolation that the fist moving back and forth represents a fist controlling a joystick, or that the arm represents the plane's wing and the hand represents a flap (aileron) on the wing. Interpreter's head omitted here for privacy.
  • Figure 2: Example of the interaction between rapid fingerspelling and long-range context. Top is the first "basil" in the narrative (itself spelled slightly out of order), and bottom is the version from the test sentence: highly coarticulated, with multiple letters produced simultaneously. The labels indicate the relevant letters given the ground truth, but without context other letters such as Y, X, and T could be perceived.