Table of Contents
Fetching ...

FSboard: Over 3 million characters of ASL fingerspelling collected via smartphones

Manfred Georg, Garrett Tanzer, Saad Hassan, Maximus Shengelia, Esha Uboweja, Sam Sepah, Sean Forbes, Thad Starner

TL;DR

FSboard introduces the largest ASL fingerspelling dataset to date, collected from 147 Deaf signers using mobile selfie cameras to support a mobile text-entry use case. The dataset comprises over 3.2 million characters and 266 hours of video, enabling long, one-handed fingerspelling sequences. A pretrained ByT5-based baseline with 30 Hz MediaPipe Holistic inputs achieves 11.1% CER on a unique-phrase test set, with ablations showing graceful degradation under lower frame rates and partial landmark removal, suggesting potential for on-device use. This work emphasizes community involvement and provides a substantial resource to accelerate practical, equitable sign-language technology development and future end-to-end sign language understanding research.

Abstract

Progress in machine understanding of sign languages has been slow and hampered by limited data. In this paper, we present FSboard, an American Sign Language fingerspelling dataset situated in a mobile text entry use case, collected from 147 paid and consenting Deaf signers using Pixel 4A selfie cameras in a variety of environments. Fingerspelling recognition is an incomplete solution that is only one small part of sign language translation, but it could provide some immediate benefit to Deaf/Hard of Hearing signers as more broadly capable technology develops. At >3 million characters in length and >250 hours in duration, FSboard is the largest fingerspelling recognition dataset to date by a factor of >10x. As a simple baseline, we finetune 30 Hz MediaPipe Holistic landmark inputs into ByT5-Small and achieve 11.1% Character Error Rate (CER) on a test set with unique phrases and signers. This quality degrades gracefully when decreasing frame rate and excluding face/body landmarks: plausible optimizations to help models run on device in real time.

FSboard: Over 3 million characters of ASL fingerspelling collected via smartphones

TL;DR

FSboard introduces the largest ASL fingerspelling dataset to date, collected from 147 Deaf signers using mobile selfie cameras to support a mobile text-entry use case. The dataset comprises over 3.2 million characters and 266 hours of video, enabling long, one-handed fingerspelling sequences. A pretrained ByT5-based baseline with 30 Hz MediaPipe Holistic inputs achieves 11.1% CER on a unique-phrase test set, with ablations showing graceful degradation under lower frame rates and partial landmark removal, suggesting potential for on-device use. This work emphasizes community involvement and provides a substantial resource to accelerate practical, equitable sign-language technology development and future end-to-end sign language understanding research.

Abstract

Progress in machine understanding of sign languages has been slow and hampered by limited data. In this paper, we present FSboard, an American Sign Language fingerspelling dataset situated in a mobile text entry use case, collected from 147 paid and consenting Deaf signers using Pixel 4A selfie cameras in a variety of environments. Fingerspelling recognition is an incomplete solution that is only one small part of sign language translation, but it could provide some immediate benefit to Deaf/Hard of Hearing signers as more broadly capable technology develops. At >3 million characters in length and >250 hours in duration, FSboard is the largest fingerspelling recognition dataset to date by a factor of >10x. As a simple baseline, we finetune 30 Hz MediaPipe Holistic landmark inputs into ByT5-Small and achieve 11.1% Character Error Rate (CER) on a test set with unique phrases and signers. This quality degrades gracefully when decreasing frame rate and excluding face/body landmarks: plausible optimizations to help models run on device in real time.
Paper Structure (12 sections, 7 figures)

This paper contains 12 sections, 7 figures.

Figures (7)

  • Figure 1: A sample of frames from FSboard. Faces blurred here but not in the dataset.
  • Figure 2: A sample of phrases from each category of FSboard. Addresses, phone numbers, and names are generated randomly; they are not real personally identifiable information (PII).
  • Figure 3: Monk Skin Tone Scale ratings for FSboard participants, annotated by majority vote of three human raters trained specifically for the skin tone task.
  • Figure 4: Perceived gender presentation of FSboard participants, annotated by human raters. Note that this is not equivalent to gender identity, because it is predicted from the videos rather than self-identified.
  • Figure 5: Summary statistics for fingerspelling recognition datasets.
  • ...and 2 more figures