Table of Contents
Fetching ...

ImageTalk: Designing a Multimodal AAC Text Generation System Driven by Image Recognition and Natural Language Generation

Boyin Yang, Puming Jiang, Per Ola Kristensson

TL;DR

<3-5 sentence high-level summary>ImageTalk addresses the challenge of low text-entry rates in AAC for people with motor neuron disease by fusing image recognition with large-language-model–driven text generation to produce richer, controllable narratives with substantial keystroke savings. The authors validate a triple-diamond design process involving proxy-users and end users, achieving up to 95.6% keystroke savings and high user satisfaction, and they distill three design guidelines plus four levels of acceptance for AI-generated content. The work demonstrates how multimodal cues from images, combined with prompts and steering, can enhance the quality and practicality of AAC storytelling. Open-source release of ImageTalk is proposed to accelerate further research and development in AI-assisted AAC.

Abstract

People living with Motor Neuron Disease (plwMND) frequently encounter speech and motor impairments that necessitate a reliance on augmentative and alternative communication (AAC) systems. This paper tackles the main challenge that traditional symbol-based AAC systems offer a limited vocabulary, while text entry solutions tend to exhibit low communication rates. To help plwMND articulate their needs about the system efficiently and effectively, we iteratively design and develop a novel multimodal text generation system called ImageTalk through a tailored proxy-user-based and an end-user-based design phase. The system demonstrates pronounced keystroke savings of 95.6%, coupled with consistent performance and high user satisfaction. We distill three design guidelines for AI-assisted text generation systems design and outline four user requirement levels tailored for AAC purposes, guiding future research in this field.

ImageTalk: Designing a Multimodal AAC Text Generation System Driven by Image Recognition and Natural Language Generation

TL;DR

<3-5 sentence high-level summary>ImageTalk addresses the challenge of low text-entry rates in AAC for people with motor neuron disease by fusing image recognition with large-language-model–driven text generation to produce richer, controllable narratives with substantial keystroke savings. The authors validate a triple-diamond design process involving proxy-users and end users, achieving up to 95.6% keystroke savings and high user satisfaction, and they distill three design guidelines plus four levels of acceptance for AI-generated content. The work demonstrates how multimodal cues from images, combined with prompts and steering, can enhance the quality and practicality of AAC storytelling. Open-source release of ImageTalk is proposed to accelerate further research and development in AI-assisted AAC.

Abstract

People living with Motor Neuron Disease (plwMND) frequently encounter speech and motor impairments that necessitate a reliance on augmentative and alternative communication (AAC) systems. This paper tackles the main challenge that traditional symbol-based AAC systems offer a limited vocabulary, while text entry solutions tend to exhibit low communication rates. To help plwMND articulate their needs about the system efficiently and effectively, we iteratively design and develop a novel multimodal text generation system called ImageTalk through a tailored proxy-user-based and an end-user-based design phase. The system demonstrates pronounced keystroke savings of 95.6%, coupled with consistent performance and high user satisfaction. We distill three design guidelines for AI-assisted text generation systems design and outline four user requirement levels tailored for AAC purposes, guiding future research in this field.

Paper Structure

This paper contains 45 sections, 2 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: The function structure model for the ImageTalk system. The fonts indicate different element types. Bold text along with the rectangle indicates the functions and sub-functions. Italic text aligned with dashed lines represents the system input and output. Normal text aligned with dashed lines represents the internal information flow. Fig. \ref{['fig:workflow_example_update']} shows an example workflow of this conceptual design.
  • Figure 2: The architecture of the widely adopted VED model using ViT as the encoder and GPT-2 as the decoder kumar2022imagecaptioning.
  • Figure 3: The architecture of DETR. This model uses a conventional neural network (CNN) backbone to learn a 2D representation of an input image. The model flattens the input image and supplements it with a positional encoding before passing it into a transformer encoder. A transformer decoder then takes as input a small fixed number of learned positional embeddings, which are called "object queries", and additionally attends to the encoder output. The model passes each output embedding of the decoder to a shared feed-forward network (FFN) that predicts either a detection (class and bounding box) or a "no object" class. carion2020end
  • Figure 4: An example workflow of the ImageTalk system. The solid line delineates the sequential steps within the workflow, commencing with user input. At this initial stage, users select images and input keywords, along with language styles corresponding to the intended narrative. These selected images undergo simultaneous processing through an image captioning model, VED li2022trocr, and an object detection model, DETR carion2020end. Extracted context information from images is subsequently transmitted to the prompt hub, together with the user-provided keywords. The updated prompt is relayed to the LLM, GPT-3.5, as employed in this system. The LLM generates a narrative that encapsulates the selected images and user-inputted keywords. The steering wheel symbol demotes steps that allow for direct user manipulation, with the system responding to these operations in real-time. Specifically, the strike-through and the dashed boxes and arrows indicate the user's steering.
  • Figure 5: This ImageTalk user interface is used for functional capability evaluation, operated by the researcher on behalf of the participants. Before the user study begins, participants provide images to the researcher, who then uploads them into the system. During the study, the researcher selects images supplied by the participant. The system extracts context information from these images and displays it in the Object Detection and Captions areas. The participant can instruct the researcher on any desired edits to this generated information. Subsequently, the participant provides keywords associated with each selected image and specifies the desired language style. After the researcher inputs this information and clicks the Generate Story button, the system automatically creates a story. It is important to note that this version of ImageTalk is NOT the finalized system but a transitional design and experiment tool aimed at evaluating the concept and participants' acceptance of the generated context. This GUI helps us better understand the design space, and the finalized system will require further iterations of the interaction design to meet end-users' operational needs, which will be addressed in future work.
  • ...and 5 more figures