Too Big to Fail: Larger Language Models are Disproportionately Resilient to Induction of Dementia-Related Linguistic Anomalies

Changye Li; Zhecheng Sheng; Trevor Cohen; Serguei Pakhomov

Too Big to Fail: Larger Language Models are Disproportionately Resilient to Induction of Dementia-Related Linguistic Anomalies

Changye Li, Zhecheng Sheng, Trevor Cohen, Serguei Pakhomov

TL;DR

This paper investigates whether larger autoregressive language models show disproportionate resilience to dementia-like linguistic perturbations. It introduces a bidirectional attention head ablation method on GPT-2 models and evaluates using a paired-perplexity paradigm on Cookie Theft transcripts from ADReSS and WLS data. The results show larger models require a greater share of masked heads to degrade similarly to smaller models, suggesting an artificial neural reserve localized to attention mechanisms. The approach achieves competitive dementia-discrimination performance with far fewer masked parameters than previous methods and points to potential use in low-resource screening and modeling of neurodegenerative progression.

Abstract

As artificial neural networks grow in complexity, understanding their inner workings becomes increasingly challenging, which is particularly important in healthcare applications. The intrinsic evaluation metrics of autoregressive neural language models (NLMs), perplexity (PPL), can reflect how "surprised" an NLM model is at novel input. PPL has been widely used to understand the behavior of NLMs. Previous findings show that changes in PPL when masking attention layers in pre-trained transformer-based NLMs reflect linguistic anomalies associated with Alzheimer's disease dementia. Building upon this, we explore a novel bidirectional attention head ablation method that exhibits properties attributed to the concepts of cognitive and brain reserve in human brain studies, which postulate that people with more neurons in the brain and more efficient processing are more resilient to neurodegeneration. Our results show that larger GPT-2 models require a disproportionately larger share of attention heads to be masked/ablated to display degradation of similar magnitude to masking in smaller models. These results suggest that the attention mechanism in transformer models may present an analogue to the notions of cognitive and brain reserve and could potentially be used to model certain aspects of the progression of neurodegenerative disorders and aging.

Too Big to Fail: Larger Language Models are Disproportionately Resilient to Induction of Dementia-Related Linguistic Anomalies

TL;DR

Abstract

Paper Structure (14 sections, 5 figures, 6 tables)

This paper contains 14 sections, 5 figures, 6 tables.

Introduction
Background
Cognitive Reserve
Probing the Neural Network
Linguistic Anomalies in AD
Methods
Data
Modeling and Evaluation
Results
Effects of Masking on Perplexity
Effects of Masking on Dementia Classification
Discussion
Conclusion
Appendix

Figures (5)

Figure 1: A theoretical illustration of cognitive reserve and its mediation effect between AD neuropathology (x-axis) and clinical outcome (y-axis). Illustration derived from stern2002cognitivestern2009cognitive. As the disease progresses (i.e., with more impairment), individuals with higher cognitive/brain reserve would be more resilient to the effects, resulting in a lower level of clinical severity.
Figure 2: The "Cookie Theft" picture description stimuli.
Figure 3: Changes in model log PPL as a function of the proportion of masked attention heads across GPT-2 models of various sizes. Note: the curves in panel (a) show that GPT-2 XL model has the most non-linear/concave shape indicating that the model starts to degrade rapidly only after masking of about 50% of its attention heads, followed by the curve for the GPT-2 large model. The smaller GPT-2 models begin to degrade with proportionally less masking, and exhibit a monotonic relationship between the magnitude of attention heads masking and model performance. The curves in panel (b) show almost completely preserved model performance without differences between models up to the point at which 40% - 50% of the columns in their embedding matrices have been masked. After that point, the performance of all models collapses "catastrophically"
Figure 4: Comparison of GPT-2 models with masked attention heads on paired-perplexity classification performance. The left y-axis denotes classification performance using both masked and unmasked GPT-2 models on the ADReSS test set. The right y-axis indicates log PPL estimated from transcripts of WLS healthy individuals. The x-axis represents the percentage of attention heads getting masked. The vertical dashed line indicates the best-performing masking pattern, achieving the highest ACC.
Figure 5: Comparison of GPT-2 models with masked columns of word embedding matrix on classification performance and cognitive reserve manifestation. The left y-axis denotes classification performance using both masked and unmasked GPT-2 models on the ADReSS test set. The right y-axis indicates log PPL estimated from transcripts of WLS healthy individuals. The x-axis represents the percentage of attention heads getting masked. The vertical dashed line indicates the best-performing masking pattern, achieving the highest ACC.

Too Big to Fail: Larger Language Models are Disproportionately Resilient to Induction of Dementia-Related Linguistic Anomalies

TL;DR

Abstract

Too Big to Fail: Larger Language Models are Disproportionately Resilient to Induction of Dementia-Related Linguistic Anomalies

Authors

TL;DR

Abstract

Table of Contents

Figures (5)