Generating Completions for Broca's Aphasic Sentences Using Large Language Models
Sijbren van Vaals, Yevgen Matusevych, Frank Tsiwah
TL;DR
This study addresses reconstructing Broca's aphasic, non-fluent speech into complete utterances by fine-tuning four encoder-decoder LLMs on synthetic Broca's aphasic data. A rule-based synthetic data generator creates Broca-like fragments from neurotypical speech, which are used to train T5 and Flan-T5 models for sentence completion; evaluations include synthetic data metrics and qualitative assessments on authentic aphasic data from AphasiaBank. Results show synthetic data closely mirrors authentic aphasic patterns in surprisal, with longer input context improving completions, though authentic data reveal limitations such as occasional reproduction or negation in outputs. The work demonstrates the potential of LLM-based assistive tools for aphasia rehabilitation and highlights the value and challenges of synthetic data in clinical NLP, outlining clear directions for future benchmarking and cross-language expansion.
Abstract
Broca's aphasia is a type of aphasia characterized by non-fluent, effortful and agrammatic speech production with relatively good comprehension. Since traditional aphasia treatment methods are often time-consuming, labour-intensive, and do not reflect real-world conversations, applying natural language processing based approaches such as Large Language Models (LLMs) could potentially contribute to improving existing treatment approaches. To address this issue, we explore the use of sequence-to-sequence LLMs for completing Broca's aphasic sentences. We first generate synthetic Broca's aphasic data using a rule-based system designed to mirror the linguistic characteristics of Broca's aphasic speech. Using this synthetic data (without authentic aphasic samples), we then fine-tune four pre-trained LLMs on the task of completing agrammatic sentences. We evaluate our fine-tuned models on both synthetic and authentic Broca's aphasic data. We demonstrate LLMs' capability for reconstructing agrammatic sentences, with the models showing improved performance with longer input utterances. Our result highlights the LLMs' potential in advancing communication aids for individuals with Broca's aphasia and possibly other clinical populations.
