Subtle Biases Need Subtler Measures: Dual Metrics for Evaluating Representative and Affinity Bias in Large Language Models

Abhishek Kumar; Sarfaroz Yunusov; Ali Emami

Subtle Biases Need Subtler Measures: Dual Metrics for Evaluating Representative and Affinity Bias in Large Language Models

Abhishek Kumar, Sarfaroz Yunusov, Ali Emami

TL;DR

This work addresses subtle biases in LLM outputs by introducing the Representative Bias Score ($RBS$) and the Affinity Bias Score ($ABS$) and by deploying the Creativity-Oriented Generation Suite (CoGS) to quantify generation and evaluation biases. By evaluating GPT-4, LLaMA-2, and Mixtral on 3,240 CoGS problem instances, the authors uncover pronounced representative biases toward white, straight, and male identities and model-specific affinity-bias fingerprints, with human evaluators showing related patterns. The methodology combines semantic-similarity analysis of identity-modulated outputs and evaluator preferences to produce cross-model bias profiles, enabling scalable benchmarking of subtle biases in creative generation and evaluation contexts. These insights have practical implications for fairness in AI-assisted storytelling and evaluation, and the work paves the way for bias-awareness tools and broader axis inclusion in future studies.

Abstract

Research on Large Language Models (LLMs) has often neglected subtle biases that, although less apparent, can significantly influence the models' outputs toward particular social narratives. This study addresses two such biases within LLMs: representative bias, which denotes a tendency of LLMs to generate outputs that mirror the experiences of certain identity groups, and affinity bias, reflecting the models' evaluative preferences for specific narratives or viewpoints. We introduce two novel metrics to measure these biases: the Representative Bias Score (RBS) and the Affinity Bias Score (ABS), and present the Creativity-Oriented Generation Suite (CoGS), a collection of open-ended tasks such as short story writing and poetry composition, designed with customized rubrics to detect these subtle biases. Our analysis uncovers marked representative biases in prominent LLMs, with a preference for identities associated with being white, straight, and men. Furthermore, our investigation of affinity bias reveals distinctive evaluative patterns within each model, akin to `bias fingerprints'. This trend is also seen in human evaluators, highlighting a complex interplay between human and machine bias perceptions.

Subtle Biases Need Subtler Measures: Dual Metrics for Evaluating Representative and Affinity Bias in Large Language Models

TL;DR

This work addresses subtle biases in LLM outputs by introducing the Representative Bias Score (

) and the Affinity Bias Score (

) and by deploying the Creativity-Oriented Generation Suite (CoGS) to quantify generation and evaluation biases. By evaluating GPT-4, LLaMA-2, and Mixtral on 3,240 CoGS problem instances, the authors uncover pronounced representative biases toward white, straight, and male identities and model-specific affinity-bias fingerprints, with human evaluators showing related patterns. The methodology combines semantic-similarity analysis of identity-modulated outputs and evaluator preferences to produce cross-model bias profiles, enabling scalable benchmarking of subtle biases in creative generation and evaluation contexts. These insights have practical implications for fairness in AI-assisted storytelling and evaluation, and the work paves the way for bias-awareness tools and broader axis inclusion in future studies.

Abstract

Paper Structure (17 sections, 10 equations, 20 figures, 6 tables)

This paper contains 17 sections, 10 equations, 20 figures, 6 tables.

Introduction
Creativity-Oriented Generation Suite
Measuring Subtle Bias in LLMs
Representative Bias
Affinity Bias
Experiments & Results
Experimental Design
Results
Which Identities do LLMs Default To?
Do LLMs Show Preference for Certain Identities?
Qualitative Analysis
Related Work
Conclusion
Appendix
Affinity Biases: GPT-4 as an evaluator
...and 2 more sections

Figures (20)

Figure 1: Proportion of GPT-4's preferred responses for the short poem task in CoGS, categorized by identity-specific prompts, with highlighted sectors indicating a preference for outputs from those identities.
Figure 2: Short Poem task ($t$) in CoGS with identity prompt ($i$), theme ($c$), and evaluated using rubric ($t_r$). This illustrates how tasks integrate themes and identities into creative outputs, assessed by predefined criteria.
Figure 3: Illustration of calculating semantic similarity for representative bias (left) and selecting the best outputs for affinity bias (right). Semantic similarity is measured by comparing vector embeddings of outputs from default ($O_d$) and identity-specific ($O_i, i \in {races}$) prompts. The right side shows the evaluator LLM's selection of preferred outputs from $O_i$ across themes, represented as a pie chart of overall preferences.
Figure 4: Bar charts illustrating the semantic similarity for contents generated by each LLM across identity axes, in contrast to default responses.
Figure 5: Radar plots display affinity biases for three LLM evaluators — GPT-4, LLaMA-2, and Mixtral.
...and 15 more figures

Subtle Biases Need Subtler Measures: Dual Metrics for Evaluating Representative and Affinity Bias in Large Language Models

TL;DR

Abstract

Subtle Biases Need Subtler Measures: Dual Metrics for Evaluating Representative and Affinity Bias in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (20)