The Mismeasure of Man and Models: Evaluating Allocational Harms in Large Language Models

Hannah Chen; Yangfeng Ji; David Evans

The Mismeasure of Man and Models: Evaluating Allocational Harms in Large Language Models

Hannah Chen, Yangfeng Ji, David Evans

TL;DR

The paper tackles the problem of measuring allocational harms in large language models when predictions drive resource-constrained decisions. It introduces Rank-Allocational-Based Bias Index (RABBI), a model-agnostic, rank-based metric that compares candidate scores to quantify allocation disparities, with scoring implemented via pointwise or pairwise LLM ranking. Across two allocation tasks (resume screening and essay grading) and ten LLMs, RABBI exhibits strong correlation with actual allocation gaps, outperforming traditional metrics like average prediction gaps and distribution-based divergences. The findings demonstrate that auditing models using outcome-agnostic bias measures can misrepresent harms, whereas RABBI provides a task-aligned assessment that supports better model selection to minimize allocational harms in limited-resource settings. The work has practical implications for AI governance and auditing, urging deployment-aware evaluation that directly links model outputs to decision outcomes.

Abstract

Large language models (LLMs) are now being considered and even deployed for applications that support high-stakes decision-making, such as recruitment and clinical decisions. While several methods have been proposed for measuring bias, there remains a gap between predictions, which are what the proposed methods consider, and how they are used to make decisions. In this work, we introduce Rank-Allocational-Based Bias Index (RABBI), a model-agnostic bias measure that assesses potential allocational harms arising from biases in LLM predictions. We compare RABBI and current bias metrics on two allocation decision tasks. We evaluate their predictive validity across ten LLMs and utility for model selection. Our results reveal that commonly-used bias metrics based on average performance gap and distribution distance fail to reliably capture group disparities in allocation outcomes, whereas RABBI exhibits a strong correlation with allocation disparities. Our work highlights the need to account for how models are used in contexts with limited resource constraints.

The Mismeasure of Man and Models: Evaluating Allocational Harms in Large Language Models

TL;DR

Abstract

Paper Structure (21 sections, 11 equations, 13 figures, 2 tables)

This paper contains 21 sections, 11 equations, 13 figures, 2 tables.

Introduction
Background: Bias in NLP
Measuring Bias
Allocational Harms
Measuring Allocational Bias
Proposed Bias Measure
Rank Scoring using LLMs
Evaluating Bias Metrics
Evaluation Tasks
Measuring Allocation Gaps
Bias Metric Baselines
Experimental Setup
Results
Predictive Validity Test
Metric Utility for Model Selection
...and 6 more sections

Figures (13)

Figure 1: Bias scores per group for the resume screening task. Each score is computed with respect to White Male. $\delta$ indicates the average performance gap, measured as the average score difference. The demographic parity gap ($\Delta\text{\small DP}$) represents the selection rate difference over multiple candidate selection rounds, with selection quota $k=2$ for each round.
Figure 2: Measurement comparison between bias metrics and allocation gaps, with quota $k=1$. Each point represents a score measured between group ${ \mathcal{A}\in \mathcal{G}\setminus \mathcal{B}}$ and reference group ${\mathcal{B}}$ given a model ${\mathcal{M}}$ for a job position or an essay topic. RABBI shows higher correlations with $\Delta\text{\small DP}$ and $\Delta\text{\small EO}$.
Figure 3: Measurement comparison between bias metrics and allocation gaps for pairwise evaluation, with quota $k=1$. Models with $>55\%$ of inconsistent outputs are excluded.
Figure 4: Average NDCG${\textit{@}}N$ in ranking model fairness, comparing to ideal rankings. EMD yields the same results as the average score gap. Due to inconsistencies in pairwise outputs, we exclude two models from resume screening and all pairwise results for essay grading (7 models with $>50\%$ inconsistencies).
Figure 5: Fairness ranking of models for each resume screening job position with pointwise evaluation and selection quota $k=2$. The true rank order is based on $\Delta\text{\small DP}$. Existing bias metrics often rank more biased models as more "fair". RABBI shows more similar rankings to the true rank order.
...and 8 more figures

The Mismeasure of Man and Models: Evaluating Allocational Harms in Large Language Models

TL;DR

Abstract

The Mismeasure of Man and Models: Evaluating Allocational Harms in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (13)