Do Large Language Models Rank Fairly? An Empirical Study on the Fairness of LLMs as Rankers

Yuan Wang; Xuyang Wu; Hsin-Tai Wu; Zhiqiang Tao; Yi Fang

Do Large Language Models Rank Fairly? An Empirical Study on the Fairness of LLMs as Rankers

Yuan Wang, Xuyang Wu, Hsin-Tai Wu, Zhiqiang Tao, Yi Fang

TL;DR

Do Large Language Models Rank Fairly? An Empirical Study on the Fairness of LLMs as Rankers investigates whether LLMs, when used to rank documents, produce biased outcomes with respect to binary protected attributes such as $ gender $ and $ geography $. It develops a dual evaluation framework (listwise and pairwise) and constructs neutral and sensitive prompts to quantify user- and item-side fairness, formalizing group exposure via $Exposure(G|P)$ and the ratio $Exposure(G_1|P)/Exposure(G_0|P)$ alongside $P@20$ as a utility metric. Evaluating GPT-3.5, GPT-4, Mistral-7b, and Llama2-13b on the TREC Fair Ranking data, the study finds that neural rankers often achieve higher precision than LLMs, yet LLMs exhibit varied fairness patterns across attributes; LoRA fine-tuning of Mistral-7b improves fairness, yielding exposure ratios closer to 1.0. This work provides the first fairness benchmark for LLMs as rankers and demonstrates a practical, parameter-efficient mitigation path, informing the design of fairer search and ranking systems in real-world deployments.

Abstract

The integration of Large Language Models (LLMs) in information retrieval has raised a critical reevaluation of fairness in the text-ranking models. LLMs, such as GPT models and Llama2, have shown effectiveness in natural language understanding tasks, and prior works (e.g., RankGPT) have also demonstrated that the LLMs exhibit better performance than the traditional ranking models in the ranking task. However, their fairness remains largely unexplored. This paper presents an empirical study evaluating these LLMs using the TREC Fair Ranking dataset, focusing on the representation of binary protected attributes such as gender and geographic location, which are historically underrepresented in search outcomes. Our analysis delves into how these LLMs handle queries and documents related to these attributes, aiming to uncover biases in their ranking algorithms. We assess fairness from both user and content perspectives, contributing an empirical benchmark for evaluating LLMs as the fair ranker.

Do Large Language Models Rank Fairly? An Empirical Study on the Fairness of LLMs as Rankers

TL;DR

and

. It develops a dual evaluation framework (listwise and pairwise) and constructs neutral and sensitive prompts to quantify user- and item-side fairness, formalizing group exposure via

and the ratio

alongside

as a utility metric. Evaluating GPT-3.5, GPT-4, Mistral-7b, and Llama2-13b on the TREC Fair Ranking data, the study finds that neural rankers often achieve higher precision than LLMs, yet LLMs exhibit varied fairness patterns across attributes; LoRA fine-tuning of Mistral-7b improves fairness, yielding exposure ratios closer to 1.0. This work provides the first fairness benchmark for LLMs as rankers and demonstrates a practical, parameter-efficient mitigation path, informing the design of fairer search and ranking systems in real-world deployments.

Abstract

Paper Structure (22 sections, 2 equations, 4 figures, 3 tables)

This paper contains 22 sections, 2 equations, 4 figures, 3 tables.

Introduction
Related Works
Ranking with LLMs
Fairness in LLMs
Fairness in Search and Ranking
LLM Fair Ranking
Datasets
Listwise Evaluation
Data Construction
Metrics
Pairwise Evaluation
Data Construction
Metrics
Results and Analysis
Listwise Evaluation Results
...and 7 more sections

Figures (4)

Figure 1: Illustration of two evaluation methods: (a) Listwise evaluation and (b) Pairwise evaluation. Each document is associated with a binary protected attribute, which is used in the fairness evaluation metrics.
Figure 2: Proposed Evaluation Framework: This schematic diagram represents our dual evaluation methodology. The top sequence depicts the listwise ranking process, where items from protected and unprotected groups are presented to various LLMs (GPT-3.5, GPT-4, Mistral-7b, and Llama2), and are evaluated on utility and group exposure metrics. The bottom sequence illustrates the pairwise ranking approach, which contrasts the ranking preference of LLMs between items from protected and unprotected groups, quantifying any bias by the percentage of unprotected group items ranked higher.
Figure 3: The predicted rankings distribution of the protected groups on the TREC datasets using the listwise evaluation. The plots reveal the ranking variability and potential biases in gender and geographic attributes, highlighting areas for improvement in fairness across the LLMs.
Figure 4: Impact of LoRA Fine-Tuning on Mistral-7b's Fairness. Figure (a) shows the percentage of first-ranked items from protected and unprotected groups, while Figure (b) demonstrates the resulting fairness ratios. The LoRA-adjusted model yields ratios closer to the ideal fairness benchmark of 1.0 across TREC datasets.

Do Large Language Models Rank Fairly? An Empirical Study on the Fairness of LLMs as Rankers

TL;DR

Abstract

Do Large Language Models Rank Fairly? An Empirical Study on the Fairness of LLMs as Rankers

Authors

TL;DR

Abstract

Table of Contents

Figures (4)