Table of Contents
Fetching ...

Towards an AI-Driven Video-Based American Sign Language Dictionary: Exploring Design and Usage Experience with Learners

Saad Hassan, Matyas Bohacek, Chaelin Kim, Denise Crochet

TL;DR

This work tackles the difficulty of looking up ASL signs without text queries by delivering a fully automated, privacy-preserving video-based dictionary built on state-of-the-art sign recognition. Leveraging a Transformer-based SPOTER architecture, the authors provide a feedback-rich UI that displays confidence levels and latency information, enabling learners to refine their video submissions. An observational study with 12 novice ASL learners using real tasks reveals benefits in search ease, exposure to signing variation, and comprehension, while also uncovering challenges related to recording unknown signs, output unpredictability, latency, and privacy concerns. The results yield practical design guidance for deployment, including real-time submission feedback, confidence-based result ranking, and strategies to reduce latency and background noise, with the prototype openly released for research use. Collectively, the paper advances video-based ASL dictionary design by integrating prior WoZ insights with functional AI feedback in authentic learning contexts, highlighting both educational value and non-functional considerations like bias and privacy.

Abstract

Searching for unfamiliar American Sign Language (ASL) signs is challenging for learners because, unlike spoken languages, they cannot type a text-based query to look up an unfamiliar sign. Advances in isolated sign recognition have enabled the creation of video-based dictionaries, allowing users to submit a video and receive a list of the closest matching signs. Previous HCI research using Wizard-of-Oz prototypes has explored interface designs for ASL dictionaries. Building on these studies, we incorporate their design recommendations and leverage state-of-the-art sign-recognition technology to develop an automated video-based dictionary. We also present findings from an observational study with twelve novice ASL learners who used this dictionary during video-comprehension and question-answering tasks. Our results address human-AI interaction challenges not covered in previous WoZ research, including recording and resubmitting signs, unpredictable outputs, system latency, and privacy concerns. These insights offer guidance for designing and deploying video-based ASL dictionary systems.

Towards an AI-Driven Video-Based American Sign Language Dictionary: Exploring Design and Usage Experience with Learners

TL;DR

This work tackles the difficulty of looking up ASL signs without text queries by delivering a fully automated, privacy-preserving video-based dictionary built on state-of-the-art sign recognition. Leveraging a Transformer-based SPOTER architecture, the authors provide a feedback-rich UI that displays confidence levels and latency information, enabling learners to refine their video submissions. An observational study with 12 novice ASL learners using real tasks reveals benefits in search ease, exposure to signing variation, and comprehension, while also uncovering challenges related to recording unknown signs, output unpredictability, latency, and privacy concerns. The results yield practical design guidance for deployment, including real-time submission feedback, confidence-based result ranking, and strategies to reduce latency and background noise, with the prototype openly released for research use. Collectively, the paper advances video-based ASL dictionary design by integrating prior WoZ insights with functional AI feedback in authentic learning contexts, highlighting both educational value and non-functional considerations like bias and privacy.

Abstract

Searching for unfamiliar American Sign Language (ASL) signs is challenging for learners because, unlike spoken languages, they cannot type a text-based query to look up an unfamiliar sign. Advances in isolated sign recognition have enabled the creation of video-based dictionaries, allowing users to submit a video and receive a list of the closest matching signs. Previous HCI research using Wizard-of-Oz prototypes has explored interface designs for ASL dictionaries. Building on these studies, we incorporate their design recommendations and leverage state-of-the-art sign-recognition technology to develop an automated video-based dictionary. We also present findings from an observational study with twelve novice ASL learners who used this dictionary during video-comprehension and question-answering tasks. Our results address human-AI interaction challenges not covered in previous WoZ research, including recording and resubmitting signs, unpredictable outputs, system latency, and privacy concerns. These insights offer guidance for designing and deploying video-based ASL dictionary systems.

Paper Structure

This paper contains 50 sections, 2 equations, 10 figures.

Figures (10)

  • Figure 1: Performance of our sign language recognition model for signs grouped by movement (a), number of hands (b), and location (c). For each feature, top-1 and top-7 accuracies on ASL Citizen are reported. ASL Citizen top-1 accuracy is shown in vivid green $\blacksquare$, while top-7 appears in vivid green with half opacity $\blacksquare$.
  • Figure 2: The top-1 and top-7 testing accuracy of our sign language recognition model (y-axis) for input videos of increasing resolution ratio (x-axis) on ASL Citizen. Vivid green $\blacksquare$ represents top-1 accuracy; vivid red $\blacksquare$ represents top-7. The resolution ratio indicates video resolution relative to the standard $640\times480$ pixels.
  • Figure 3: Latency analysis of the custom sign recognition AI model, showing the prediction time as a function of input video length. A linear line is fitted on the data.
  • Figure 4: The 'Detailed analysis' page presents the model's top $20$ predictions in a scrollable grid of sign entries, ordered by likelihood. Each entry includes a representative recording, word translation, probability score, and metadata.
  • Figure 5: The webcam recording and file upload interface includes built-in tools for video clipping and editing.
  • ...and 5 more figures