Table of Contents
Fetching ...

Giving Meaning to Movements: Challenges and Opportunities in Expanding Communication by Pairing Unaided AAC with Speech Generated Messages

Imran Kabir, Sharon Ann Redmon, Lynn R Elko, Kevin Williams, Mitchell A Case, Dawn J Sowers, Krista Wilkinson, Syed Masum Billah

TL;DR

AllyAAC, a wearable system with a wrist-worn IMU paired with a smartphone app, is developed and evaluated, producing a dataset containing over 600,000 multimodal data points featuring atypical gestures--the first of its kind.

Abstract

Augmentative and Alternative Communication (AAC) technologies are categorized into two forms: aided AAC, which uses external devices like speech-generating systems to produce standardized output, and unaided AAC, which relies on body-based gestures for natural expression but requires shared understanding. We investigate how to combine these approaches to harness the speed and naturalness of unaided AAC while maintaining the intelligibility of aided AAC, a largely unexplored area for individuals with communication and motor impairments. Through 18 months of participatory design with AAC users, we identified key challenges and opportunities and developed AllyAAC, a wearable system with a wrist-worn IMU paired with a smartphone app. We evaluated AllyAAC in a field study with 14 participants and produced a dataset containing over 600,000 multimodal data points featuring atypical gestures--the first of its kind. Our findings reveal challenges in recognizing personalized, idiosyncratic gestures and demonstrate how to address them using Transformer-based large machine learning (ML) models with different pretraining strategies. In sum, we contribute design principles and a reference implementation for adaptive, personalized systems combining aided and unaided AAC.

Giving Meaning to Movements: Challenges and Opportunities in Expanding Communication by Pairing Unaided AAC with Speech Generated Messages

TL;DR

AllyAAC, a wearable system with a wrist-worn IMU paired with a smartphone app, is developed and evaluated, producing a dataset containing over 600,000 multimodal data points featuring atypical gestures--the first of its kind.

Abstract

Augmentative and Alternative Communication (AAC) technologies are categorized into two forms: aided AAC, which uses external devices like speech-generating systems to produce standardized output, and unaided AAC, which relies on body-based gestures for natural expression but requires shared understanding. We investigate how to combine these approaches to harness the speed and naturalness of unaided AAC while maintaining the intelligibility of aided AAC, a largely unexplored area for individuals with communication and motor impairments. Through 18 months of participatory design with AAC users, we identified key challenges and opportunities and developed AllyAAC, a wearable system with a wrist-worn IMU paired with a smartphone app. We evaluated AllyAAC in a field study with 14 participants and produced a dataset containing over 600,000 multimodal data points featuring atypical gestures--the first of its kind. Our findings reveal challenges in recognizing personalized, idiosyncratic gestures and demonstrate how to address them using Transformer-based large machine learning (ML) models with different pretraining strategies. In sum, we contribute design principles and a reference implementation for adaptive, personalized systems combining aided and unaided AAC.
Paper Structure (75 sections, 11 figures, 2 tables)

This paper contains 75 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: (a) The home screen of AllyAAC app showing all the available functionality of the app. (b) The window for pairing the sensor to the app. Paired sensors appear under "Your Devices," while new sensors appear under "New Devices." (c) The data recording window, along with how the IMU sensor is worn on the wrist. (d) The “Run Model” window, where the user can select a model from the dropdown (e) and test real-time gesture recognition. The interface also provides a toggle button (clutching mechanism) for enabling and disabling recognition (f) and speaks the message associated with each gesture using a text-to-speech (TTS) engine.
  • Figure 2: Screenshots of the app showing the gesture annotation process. (a) The "Annotate Data" window shows the available gesture categories. Users can create a new category by tapping on the "+ NEW GESTURE CATEGORY" button. A communicative message can be assigned to each of the categories. A tap on "forget" will open the window shown in (b), which lists the annotated instances of the "forget" gesture. New annotations of the "forget" gesture can be added by clicking on "+ NEW GESTURE INSTANCE." (c) The annotation window with the selected video and corresponding sensor readings. Users can select a segment that represents the "forget" gesture in the timeline scrubber and save the new instance using the save button at the top-right in (c). (d) A visualization of annotated gestures in the app.
  • Figure 3: Synchronized video frames with accelerometer and gyroscope data showing the movement and inertial measurements of the "forget it" gesture of P03. The top row shows the video frames with accurate timestamps displayed above each frame. The bottom two rows show the accelerometer and gyroscope sensor readings synchronized with the video timestamps.
  • Figure 4: Architecture used for our large gesture recognition model. The model takes a time series of 6-channel IMU signals (accelerometer and gyroscope), applies temporal convolution to extract local motion features, and projects the result into a token sequence. Positional embeddings are added before passing the sequence through the transformer encoder blocks to model temporal dependencies. The transformer outputs are pooled across time using global average pooling, and a final linear layer and softmax map the result to gesture class probabilities.
  • Figure 5: Illustration of the complete model training pipeline. (a) Simplified model architecture for self-supervised pretraining. For three different tasks, three different loss functions (Triplet schroff2015facenet or NT-Xent chen2020simple, InfoNCE oord2018representation, and MSE) are used. (b) Simplified architecture of the model used in supervised fine-tuning. In this stage, we use cross-entropy loss. (c) Finally, the fine-tuned model is used for real-time inference, where the model outputs predicted probabilities for each gesture category.
  • ...and 6 more figures