Table of Contents
Fetching ...

Beyond Keywords: Evaluating Large Language Model Classification of Nuanced Ableism

Naba Rizvi, Harper Strickland, Saleha Ahmedi, Aekta Kallepalli, Isha Khirwadkar, William Wu, Imani N. S. Munyaka, Nedjma Ousidhoum

TL;DR

The paper addresses the challenge of detecting nuanced anti-autistic ableism in text by evaluating four LLMs against human autistic perspectives using the Autalic dataset, SATA, AQ, and IAT, complemented by in-context learning and persona prompts. It reveals that LLMs largely rely on surface keywords and struggle to interpret context, speaker identity, and potential impact, often reproducing human biases rather than autistic viewpoints. A binary labeling scheme reduces noise and yields better inter-model agreement, while in-context prompts and personas show limited efficacy in aligning models with autistic perspectives. The work underscores the need for neurodiversity-inclusive NLP design and collaboration with autistic communities to improve ethical and accurate detection of ableist content in real-world applications.

Abstract

Large language models (LLMs) are increasingly used in decision-making tasks like résumé screening and content moderation, giving them the power to amplify or suppress certain perspectives. While previous research has identified disability-related biases in LLMs, little is known about how they conceptualize ableism or detect it in text. We evaluate the ability of four LLMs to identify nuanced ableism directed at autistic individuals. We examine the gap between their understanding of relevant terminology and their effectiveness in recognizing ableist content in context. Our results reveal that LLMs can identify autism-related language but often miss harmful or offensive connotations. Further, we conduct a qualitative comparison of human and LLM explanations. We find that LLMs tend to rely on surface-level keyword matching, leading to context misinterpretations, in contrast to human annotators who consider context, speaker identity, and potential impact. On the other hand, both LLMs and humans agree on the annotation scheme, suggesting that a binary classification is adequate for evaluating LLM performance, which is consistent with findings from prior studies involving human annotators.

Beyond Keywords: Evaluating Large Language Model Classification of Nuanced Ableism

TL;DR

The paper addresses the challenge of detecting nuanced anti-autistic ableism in text by evaluating four LLMs against human autistic perspectives using the Autalic dataset, SATA, AQ, and IAT, complemented by in-context learning and persona prompts. It reveals that LLMs largely rely on surface keywords and struggle to interpret context, speaker identity, and potential impact, often reproducing human biases rather than autistic viewpoints. A binary labeling scheme reduces noise and yields better inter-model agreement, while in-context prompts and personas show limited efficacy in aligning models with autistic perspectives. The work underscores the need for neurodiversity-inclusive NLP design and collaboration with autistic communities to improve ethical and accurate detection of ableist content in real-world applications.

Abstract

Large language models (LLMs) are increasingly used in decision-making tasks like résumé screening and content moderation, giving them the power to amplify or suppress certain perspectives. While previous research has identified disability-related biases in LLMs, little is known about how they conceptualize ableism or detect it in text. We evaluate the ability of four LLMs to identify nuanced ableism directed at autistic individuals. We examine the gap between their understanding of relevant terminology and their effectiveness in recognizing ableist content in context. Our results reveal that LLMs can identify autism-related language but often miss harmful or offensive connotations. Further, we conduct a qualitative comparison of human and LLM explanations. We find that LLMs tend to rely on surface-level keyword matching, leading to context misinterpretations, in contrast to human annotators who consider context, speaker identity, and potential impact. On the other hand, both LLMs and humans agree on the annotation scheme, suggesting that a binary classification is adequate for evaluating LLM performance, which is consistent with findings from prior studies involving human annotators.

Paper Structure

This paper contains 26 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Examples of two sentences labeled by (top-to-bottom order): Mistral 7B, DeepSeek 7B, Gemma-2 9B, Llama-3 8B, and our human annotators, illustrating LLM difficulties with context. This figure spans both columns.
  • Figure 2: The distribution of explicit autism acceptance (SATA) scores and likelihood of being autistic (AQ scores) among humans and LLMs in our study.
  • Figure 3: Z-scores for each LLM’s sensitivity to recognizing ableism, confidence, and agreement for $284$ sentences with human annotators reveal that LLMs are more effective at replicating biased perspectives than community perspectives.
  • Figure 4: Comparison between binary and ternary classification schemes shows reduced noise under binary classification.