ABLEIST: Intersectional Disability Bias in LLM-Generated Hiring Scenarios
Mahika Phutane, Hayoung Jung, Matthew Kim, Tanushree Mitra, Aditya Vashistha
TL;DR
This study conducts a global-south–oriented audit of six LLMs across 2,820 hiring conversations to reveal pervasive ableist and intersectional biases toward disabled candidates. It introduces ABLEist, a set of eight metrics (five ableism-specific and three intersectional) grounded in disability studies to detect covert harms, demonstrating substantial harms and complex amplification when gender and caste intersect with disability. Notably, standard toxicity detectors fail to flag these harms, prompting the authors to develop a reusable open-weight ABLEist detector via distillation of a strong teacher model. The work underscores the need for intersectional safety evaluations in high-stakes AI applications like hiring and provides both methodological tools and empirical evidence to push safety research beyond single-axis analyses. Collectively, the findings argue for integrating ABLEist assessments into AI governance to prevent deepening socio-economic disparities in the Global South and beyond.
Abstract
Large language models (LLMs) are increasingly under scrutiny for perpetuating identity-based discrimination in high-stakes domains such as hiring, particularly against people with disabilities (PwD). However, existing research remains largely Western-centric, overlooking how intersecting forms of marginalization--such as gender and caste--shape experiences of PwD in the Global South. We conduct a comprehensive audit of six LLMs across 2,820 hiring scenarios spanning diverse disability, gender, nationality, and caste profiles. To capture subtle intersectional harms and biases, we introduce ABLEIST (Ableism, Inspiration, Superhumanization, and Tokenism), a set of five ableism-specific and three intersectional harm metrics grounded in disability studies literature. Our results reveal significant increases in ABLEIST harms towards disabled candidates--harms that many state-of-the-art models failed to detect. These harms were further amplified by sharp increases in intersectional harms (e.g., Tokenism) for gender and caste-marginalized disabled candidates, highlighting critical blind spots in current safety tools and the need for intersectional safety evaluations of frontier models in high-stakes domains like hiring.
