Table of Contents
Fetching ...

Adapting Large Language Models for Character-based Augmentative and Alternative Communication

Dylan Gaines, Keith Vertanen

TL;DR

The paper addresses the problem of generating accurate next-character probabilities for AAC interfaces when using subword-tokenized LLMs. It introduces a beam-search style algorithm that removes trailing tokens to allow flexible tokenization and yields a distribution over characters from subword models. Through private AAC-like test sets, large-scale in-domain/out-of-domain data, and a DeBERTaV3 sentence classifier, the study shows that domain-adapted subword LLMs (notably opt-350m) achieve the lowest per-character perplexities and meaningful keystroke savings compared to n-gram, byte, and classification baselines. The work demonstrates the practical viability of adapting LLMs for letter-by-letter AAC input, while acknowledging trade-offs in inference latency and the need for user-specific data and evaluation to maximize real-world benefits.

Abstract

Users of Augmentative and Alternative Communication (AAC) may write letter-by-letter via an interface that uses a character language model. However, most state-of-the-art large pretrained language models predict subword tokens of variable length. We investigate how to practically use such models to make accurate and efficient character predictions. Our algorithm for producing character predictions from a subword large language model (LLM) provides more accurate predictions than using a classification layer, a byte-level LLM, or an n-gram model. Additionally, we investigate a domain adaptation procedure based on a large dataset of sentences we curated based on scoring how useful each sentence might be for spoken or written AAC communication. We find our procedure further improves model performance on simple, conversational text.

Adapting Large Language Models for Character-based Augmentative and Alternative Communication

TL;DR

The paper addresses the problem of generating accurate next-character probabilities for AAC interfaces when using subword-tokenized LLMs. It introduces a beam-search style algorithm that removes trailing tokens to allow flexible tokenization and yields a distribution over characters from subword models. Through private AAC-like test sets, large-scale in-domain/out-of-domain data, and a DeBERTaV3 sentence classifier, the study shows that domain-adapted subword LLMs (notably opt-350m) achieve the lowest per-character perplexities and meaningful keystroke savings compared to n-gram, byte, and classification baselines. The work demonstrates the practical viability of adapting LLMs for letter-by-letter AAC input, while acknowledging trade-offs in inference latency and the need for user-specific data and evaluation to maximize real-world benefits.

Abstract

Users of Augmentative and Alternative Communication (AAC) may write letter-by-letter via an interface that uses a character language model. However, most state-of-the-art large pretrained language models predict subword tokens of variable length. We investigate how to practically use such models to make accurate and efficient character predictions. Our algorithm for producing character predictions from a subword large language model (LLM) provides more accurate predictions than using a classification layer, a byte-level LLM, or an n-gram model. Additionally, we investigate a domain adaptation procedure based on a large dataset of sentences we curated based on scoring how useful each sentence might be for spoken or written AAC communication. We find our procedure further improves model performance on simple, conversational text.
Paper Structure (32 sections, 1 equation, 1 figure, 11 tables, 1 algorithm)

This paper contains 32 sections, 1 equation, 1 figure, 11 tables, 1 algorithm.

Figures (1)

  • Figure 1: Worker instructions for our conversational text collection task (weather communication situation).