Table of Contents
Fetching ...

Generating Signed Language Instructions in Large-Scale Dialogue Systems

Mert İnan, Katherine Atwell, Anthony Sicilia, Lorna Quandt, Malihe Alikhani

Abstract

We introduce a goal-oriented conversational AI system enhanced with American Sign Language (ASL) instructions, presenting the first implementation of such a system on a worldwide multimodal conversational AI platform. Accessible through a touch-based interface, our system receives input from users and seamlessly generates ASL instructions by leveraging retrieval methods and cognitively based gloss translations. Central to our design is a sign translation module powered by Large Language Models, alongside a token-based video retrieval system for delivering instructional content from recipes and wikiHow guides. Our development process is deeply rooted in a commitment to community engagement, incorporating insights from the Deaf and Hard-of-Hearing community, as well as experts in cognitive and ASL learning sciences. The effectiveness of our signing instructions is validated by user feedback, achieving ratings on par with those of the system in its non-signing variant. Additionally, our system demonstrates exceptional performance in retrieval accuracy and text-generation quality, measured by metrics such as BERTScore. We have made our codebase and datasets publicly accessible at https://github.com/Merterm/signed-dialogue, and a demo of our signed instruction video retrieval system is available at https://huggingface.co/spaces/merterm/signed-instructions.

Generating Signed Language Instructions in Large-Scale Dialogue Systems

Abstract

We introduce a goal-oriented conversational AI system enhanced with American Sign Language (ASL) instructions, presenting the first implementation of such a system on a worldwide multimodal conversational AI platform. Accessible through a touch-based interface, our system receives input from users and seamlessly generates ASL instructions by leveraging retrieval methods and cognitively based gloss translations. Central to our design is a sign translation module powered by Large Language Models, alongside a token-based video retrieval system for delivering instructional content from recipes and wikiHow guides. Our development process is deeply rooted in a commitment to community engagement, incorporating insights from the Deaf and Hard-of-Hearing community, as well as experts in cognitive and ASL learning sciences. The effectiveness of our signing instructions is validated by user feedback, achieving ratings on par with those of the system in its non-signing variant. Additionally, our system demonstrates exceptional performance in retrieval accuracy and text-generation quality, measured by metrics such as BERTScore. We have made our codebase and datasets publicly accessible at https://github.com/Merterm/signed-dialogue, and a demo of our signed instruction video retrieval system is available at https://huggingface.co/spaces/merterm/signed-instructions.

Paper Structure

This paper contains 24 sections, 2 equations, 9 figures, 1 table, 2 algorithms.

Figures (9)

  • Figure 1: An overview of our multimodal dialogue system, capable of giving signed instructions to Deaf or Hard-of-Hearing users in ASL. We first translate task instructions to an intermediate textual representation called glosses using Large Language Models; then, we fetch token-level sign videos to display on the screens of Amazon Alexa Echo Show.
  • Figure 2: A storyboard of all the screens for an origami task with ASL video instructions. The first screen from the top is the landing page with an ASL Task button to enter the signed section. The second screen shows different recipes and task options. The following screens show an instruction step. Button interactions are especially important for signers as the audio is inaccessible.
  • Figure 3: The overall architecture of our dialogue system with sign instructions for American Sign Language. Offline LLM translations make it easier to plug in a signing module into a traditional dialogue architecture.
  • Figure 4: These plots show the changes in Hit Rate and Recall@1 for our signed instruction retrieval algorithm as the available video set increases in size. Two lines represent two methods of translation from text to gloss. In a constrained setup with limited sign video storage, these plots show how many videos are needed with different translation strategies. Overall, LLMs have more diverse translations, while rule-based heuristics provide more accurate translations changing with the video dataset size.
  • Figure 5: These are the screens for an alternative task of a classic blondies recipe. The main difference for recipes is that at each step, relevant ingredients are shown in addition to the signed instruction video. This is to ensure less cognitive load on the user. Also, the first panel shows the ASL button that exists in supported recipes.
  • ...and 4 more figures