Table of Contents
Fetching ...

FLEURS-ASL: Including American Sign Language in Massively Multilingual Multitask Evaluation

Garrett Tanzer

TL;DR

FLEURS-ASL extends standard multilingual benchmarks to include ASL as video data, combining high-quality CDI translations with broad multilingual evaluation. The authors propose a unified multitask sign-to-text model that leverages a $34$-second signing context and timestamped outputs to support sentence- and discourse-level ASL translation across many languages, evaluated against human baselines and frontier models. Results show the proposed model matches or exceeds caption-level baselines on sentence-level translation and enables additional tasks, while frontier multimodal models exhibit negligible ASL understanding. The work provides a public dataset, rigorous baselines, and a framework to spur development of sign-language evaluation and modeling, emphasizing the necessity of including sign languages in standard evaluation suites.

Abstract

Sign language translation has historically been peripheral to mainstream machine translation research. In order to help converge the fields, we introduce FLEURS-ASL, an extension of the multiway parallel benchmarks FLORES (for text) and FLEURS (for speech) to support their first sign language (as video), American Sign Language, translated by 5 Certified Deaf Interpreters. FLEURS-ASL can be used to evaluate a variety of tasks -- primarily sentence- and discourse-level translation -- between ASL and 200 other languages as text, or 102 languages as speech. We provide baselines for tasks from ASL to English text using a unified modeling approach that incorporates timestamp tokens and previous text tokens in a 34-second context window, trained on random video clips from YouTube-ASL. This model meets or exceeds the performance of phrase-level baselines while supporting a multitude of new tasks. We also use FLEURS-ASL to show that multimodal frontier models have virtually no understanding of ASL, underscoring the importance of including sign languages in standard evaluation suites.

FLEURS-ASL: Including American Sign Language in Massively Multilingual Multitask Evaluation

TL;DR

FLEURS-ASL extends standard multilingual benchmarks to include ASL as video data, combining high-quality CDI translations with broad multilingual evaluation. The authors propose a unified multitask sign-to-text model that leverages a -second signing context and timestamped outputs to support sentence- and discourse-level ASL translation across many languages, evaluated against human baselines and frontier models. Results show the proposed model matches or exceeds caption-level baselines on sentence-level translation and enables additional tasks, while frontier multimodal models exhibit negligible ASL understanding. The work provides a public dataset, rigorous baselines, and a framework to spur development of sign-language evaluation and modeling, emphasizing the necessity of including sign languages in standard evaluation suites.

Abstract

Sign language translation has historically been peripheral to mainstream machine translation research. In order to help converge the fields, we introduce FLEURS-ASL, an extension of the multiway parallel benchmarks FLORES (for text) and FLEURS (for speech) to support their first sign language (as video), American Sign Language, translated by 5 Certified Deaf Interpreters. FLEURS-ASL can be used to evaluate a variety of tasks -- primarily sentence- and discourse-level translation -- between ASL and 200 other languages as text, or 102 languages as speech. We provide baselines for tasks from ASL to English text using a unified modeling approach that incorporates timestamp tokens and previous text tokens in a 34-second context window, trained on random video clips from YouTube-ASL. This model meets or exceeds the performance of phrase-level baselines while supporting a multitude of new tasks. We also use FLEURS-ASL to show that multimodal frontier models have virtually no understanding of ASL, underscoring the importance of including sign languages in standard evaluation suites.
Paper Structure (21 sections, 2 figures, 8 tables)

This paper contains 21 sections, 2 figures, 8 tables.

Figures (2)

  • Figure 1: FLEURS-ASL dataset splits. The sentences are divided among 5 interpreters, and 3 sets of splits: zero-shot ("zs"), signer-independent finetuning ("si"), and signer-dependent finetuning ("sd"). We blur the interpreters' faces in this paper for privacy, but the underlying dataset is unblurred because facial expressions are an essential component of the grammar of sign languages.
  • Figure 2: Unified multitask document-level sign to text training. Top left: Model architecture. Input text tokens (token embeddings) and up to 512 frames of half-frame-rate linearly projected MediaPipe Holistic landmarks (subset of 85 3D points) are the inputs to T5v1.1-Base (a pretrained encoder-decoder Transformer), finetuned on YouTube-ASL as described below. Right: Preprocessing steps to sample training clips from long videos. In order to sample training clips uniformly in proportion to video duration, we perform the following steps. (This is only necessary because SeqIO roberts2022t5x does not support sampling training examples in proportion to scalar values attached to them, and because we want to support multi-epoch training with random crops without duplicating the underlying data many times.) We chunk arbitrary length captioned training videos into training examples on disk of $2n$ seconds, preprocess them with MediaPipe Holistic, and carry along the timed caption track, including the previous and next $n$ seconds of captions. (Here, $n$ is 34 seconds so that 15 Hz input fits in 512 tokens.) For each training example, the start position of the clip is sampled from the first half of the chunk uniformly at random, and with probability 0.2 the duration of the clip is truncated to between $m$ and $n$ seconds. (Here, $m$ is 17 seconds). We modify this scheme slightly at the beginning and end of videos; by default, the first and last $n$ seconds of the video would be extremely undersampled because they do not overlap with other examples. We make the chunk duration $\frac{3}{2}n$ seconds rather than $2n$ and sample the start position $\sim max(0, U[-\frac{n}{2}, n])$ for the first chunk and $\sim min(\frac{n}{2}, U[0, \frac{3n}{2}])$ for the last chunk. For videos whose duration doesn't divide evenly by $2n$ seconds, we overlap the final two chunks so they are between $\frac{3n}{2}$ and $2n$ seconds long. Bottom left: Control token format and mixture. For implementation convenience, we instantiate control tokens as regular text. In order to save compute, task mixture weights are chosen by intuition without hyperparameter sweeps; we expect that optimal weights would change depending on dataset size and application priorities anyway. Captions are represented as text spans delimited by separator tokens (either ' ' or '$\backslash$n'), optionally with start and end timestamps prepended. There are three sets of captions: the "curr" captions fully contained within the input clip, the "prev" captions starting in the $n$ seconds prior to the input clip, and the "next" captions ending in the $n$ seconds after it. There are two main branches of input tokens: caption alignment (with probability 0.04) and translation (0.96). The caption alignment branch supports two modes, "choose sep" (0.5) where the input specifies caption breaks, and "copy sep" (0.5), where the model predicts them. In the latter case, the average caption duration across the source video can be provided as conditioning (0.5) as a form of controllability. Then the "left edge" caption timestamp is provided, i.e. the end timestamp of any caption that crosses the left boundary of the landmark input, so that the model knows where to start predicting outputs. Finally, two more modes for either "overflow" (0.8), where the provided captions to align may spill past the right boundary of the clip, and "subset" (0.2), where the captions are a random prefix of those covering the clip, in order to support whole-video and human-in-the-loop caption alignment respectively. The target tokens are unconditionally timed, newline-separated captions. The translation branch supports untimed (0.2) and timed (0.8) translation (determining whether the target captions include timestamp tokens), with the latter also allowing duration conditioning (0.5). Either no captions (0.2), previous captions (0.64), or previous and next captions (0.16) are provided as context, to support isolated translation, blockwise autoregressive translation, and infilling.