The Mismeasure of Man and Models: Evaluating Allocational Harms in Large Language Models
Hannah Chen, Yangfeng Ji, David Evans
TL;DR
The paper tackles the problem of measuring allocational harms in large language models when predictions drive resource-constrained decisions. It introduces Rank-Allocational-Based Bias Index (RABBI), a model-agnostic, rank-based metric that compares candidate scores to quantify allocation disparities, with scoring implemented via pointwise or pairwise LLM ranking. Across two allocation tasks (resume screening and essay grading) and ten LLMs, RABBI exhibits strong correlation with actual allocation gaps, outperforming traditional metrics like average prediction gaps and distribution-based divergences. The findings demonstrate that auditing models using outcome-agnostic bias measures can misrepresent harms, whereas RABBI provides a task-aligned assessment that supports better model selection to minimize allocational harms in limited-resource settings. The work has practical implications for AI governance and auditing, urging deployment-aware evaluation that directly links model outputs to decision outcomes.
Abstract
Large language models (LLMs) are now being considered and even deployed for applications that support high-stakes decision-making, such as recruitment and clinical decisions. While several methods have been proposed for measuring bias, there remains a gap between predictions, which are what the proposed methods consider, and how they are used to make decisions. In this work, we introduce Rank-Allocational-Based Bias Index (RABBI), a model-agnostic bias measure that assesses potential allocational harms arising from biases in LLM predictions. We compare RABBI and current bias metrics on two allocation decision tasks. We evaluate their predictive validity across ten LLMs and utility for model selection. Our results reveal that commonly-used bias metrics based on average performance gap and distribution distance fail to reliably capture group disparities in allocation outcomes, whereas RABBI exhibits a strong correlation with allocation disparities. Our work highlights the need to account for how models are used in contexts with limited resource constraints.
