The Text Aphasia Battery (TAB): A Clinically-Grounded Benchmark for Aphasia-Like Deficits in Language Models
Nathan Roll, Jill Kries, Flora Jin, Catherine Wang, Ann Marie Finley, Meghan Sumner, Cory Shain, Laura Gwilliams
TL;DR
The paper presents the Text Aphasia Battery (TAB), a text-only benchmark adapted from the Quick Aphasia Battery to assess aphasia-like deficits in language models. It reframes aphasia evaluation as a computational behavioral benchmark, with four subtests (Connected Text, Word Comprehension, Sentence Comprehension, Repetition) and an automated APROCSA-based evaluation using in-context prompting in Gemini 2.5 Flash, achieving reliability comparable to expert raters when prevalence is accounted for ($\kappa$ values of $0.255$ vs $0.286$). The TAB enables scalable, text-based analysis of linguistic breakdown in LLMs, bridging clinical aphasiology and computational linguistics while acknowledging limitations such as modality constraints, limited items, and the need for language/cultural adaptation. This work provides a principled, open framework for probing language structure and representational integrity in artificial systems, with potential to guide model refinement and theoretical work on language processing. ${ ext{TAB}}$ thus serves as a practical tool for large-scale study of aphasia-like patterns in AI and a stepping stone toward more nuanced cross-disciplinary theories of language in machines.
Abstract
Large language models (LLMs) have emerged as a candidate "model organism" for human language, offering an unprecedented opportunity to study the computational basis of linguistic disorders like aphasia. However, traditional clinical assessments are ill-suited for LLMs, as they presuppose human-like pragmatic pressures and probe cognitive processes not inherent to artificial architectures. We introduce the Text Aphasia Battery (TAB), a text-only benchmark adapted from the Quick Aphasia Battery (QAB) to assess aphasic-like deficits in LLMs. The TAB comprises four subtests: Connected Text, Word Comprehension, Sentence Comprehension, and Repetition. This paper details the TAB's design, subtests, and scoring criteria. To facilitate large-scale use, we validate an automated evaluation protocol using Gemini 2.5 Flash, which achieves reliability comparable to expert human raters (prevalence-weighted Cohen's kappa = 0.255 for model--consensus agreement vs. 0.286 for human--human agreement). We release TAB as a clinically-grounded, scalable framework for analyzing language deficits in artificial systems.
