Table of Contents
Fetching ...

Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework

Nora Petrova, Andrew Gordon, Enzo Blindow

TL;DR

HUMAINE, a framework for multidimensional, demographically aware measurement of human-AI interaction, is introduced and the need for a more multidimensional, demographically aware perspective in LLM evaluation is emphasised.

Abstract

The evaluation of large language models faces significant challenges. Technical benchmarks often lack real-world relevance, while existing human preference evaluations suffer from unrepresentative sampling, superficial assessment depth, and single-metric reductionism. To address these issues, we introduce HUMAINE, a framework for multidimensional, demographically aware measurement of human-AI interaction. We collected multi-turn, naturalistic conversations from 23,404 participants that were stratified across 22 demographic groups, both in the US and UK, to evaluate 28 state-of-the-art models across five human-centric dimensions. We use a hierarchical Bayesian Bradley-Terry-Davidson (BTD) model, with post-stratification to census data, and our analysis reveals three key insights. \textbf{(1)} We establish a clear performance hierarchy where \texttt{google/gemini-2.5-pro} ranks first overall, with a 95.6\% posterior probability of being the top-ranked model. \textbf{(2)} We uncover significant preference heterogeneity, with user age emerging as the primary demographic axis of disagreement; a model's perceived rank can shift substantially across age groups, exposing failures in generalisation that unrepresentative samples typically mask. \textbf{(3)} We quantify the vast difference in discriminative power across evaluation dimensions, with ambiguous qualities like \textit{Trust, Ethics \& Safety} showing a 65\% tie rate, in stark contrast to the decisive 10\% tie rate for \textit{Overall Winner}. Our work emphasises the need for a more multidimensional, demographically aware perspective in LLM evaluation. We release our complete dataset, interactive leaderboard, and open-source framework.

Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework

TL;DR

HUMAINE, a framework for multidimensional, demographically aware measurement of human-AI interaction, is introduced and the need for a more multidimensional, demographically aware perspective in LLM evaluation is emphasised.

Abstract

The evaluation of large language models faces significant challenges. Technical benchmarks often lack real-world relevance, while existing human preference evaluations suffer from unrepresentative sampling, superficial assessment depth, and single-metric reductionism. To address these issues, we introduce HUMAINE, a framework for multidimensional, demographically aware measurement of human-AI interaction. We collected multi-turn, naturalistic conversations from 23,404 participants that were stratified across 22 demographic groups, both in the US and UK, to evaluate 28 state-of-the-art models across five human-centric dimensions. We use a hierarchical Bayesian Bradley-Terry-Davidson (BTD) model, with post-stratification to census data, and our analysis reveals three key insights. \textbf{(1)} We establish a clear performance hierarchy where \texttt{google/gemini-2.5-pro} ranks first overall, with a 95.6\% posterior probability of being the top-ranked model. \textbf{(2)} We uncover significant preference heterogeneity, with user age emerging as the primary demographic axis of disagreement; a model's perceived rank can shift substantially across age groups, exposing failures in generalisation that unrepresentative samples typically mask. \textbf{(3)} We quantify the vast difference in discriminative power across evaluation dimensions, with ambiguous qualities like \textit{Trust, Ethics \& Safety} showing a 65\% tie rate, in stark contrast to the decisive 10\% tie rate for \textit{Overall Winner}. Our work emphasises the need for a more multidimensional, demographically aware perspective in LLM evaluation. We release our complete dataset, interactive leaderboard, and open-source framework.
Paper Structure (48 sections, 5 equations, 11 figures, 1 table)

This paper contains 48 sections, 5 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Model performance on the "Overall Winner" metric. Bars represent the Score (expected points in a round-robin tournament; max=27, mean=13.5), with 95% credible intervals.
  • Figure 2: Demographic preference heterogeneity, shown by: (Left) inter-group disagreement (avg. rank difference), and (Right) user decisiveness (tie rates by age).
  • Figure 3: Heatmap showing model rankings across five evaluation dimensions. Lower ranks (darker green) indicate better performance. Models show significant variation in their relative strengths, with some excelling in reasoning while others lead in communication or trust.
  • Figure 4: Discriminative power of evaluation dimensions measured by tie rates. Trust, Ethics & Safety shows the highest ambiguity (65% ties), while Overall Winner is most decisive (10% ties).
  • Figure 5: Decomposition of tie rates for Age × Politics (US). Left: Observed tie rates. Middle: Expected tie rates under additive model (grand mean + row effect + col effect). Right: Interaction effects in percentage points (observed - expected).
  • ...and 6 more figures