Table of Contents
Fetching ...

A Normative Framework for Benchmarking Consumer Fairness in Large Language Model Recommender System

Yashar Deldjoo, Fatemeh Nazary

TL;DR

This paper tackles fairness in RecLLMs by introducing a normative framework that distinguishes fairness when sensitive attributes are used ($\mathcal{R}_{N}$ vs $\mathcal{R}_{S}^{a}$) from fairness relative to target distributions via Intrinsic Fairness (IF). It formalizes the evaluation around three metrics—Neutral vs. Sensitive Ranker Deviation ($\text{NSD}$), Neutral vs. Counterfactual Sensitive Deviation ($\text{NCSD}$), and IF—grounded in benefit measures $\mathcal{B}$ and deviations $\Delta \mathcal{B}$, with in-context learning variants (0-shot, ICL-1, ICL-2) applied to MovieLens data. Empirical results show gender fairness remains largely stable, while age-based fairness can be significantly affected by contextual information, especially under ICL-2 conditions, highlighting the risk of bias amplification with richer prompts. The work provides a principled, transparent auditing framework for RecLLMs, enabling practitioners to select appropriate reference rankers, define benefit types, and assess statistical significance to guide mitigation efforts.

Abstract

The rapid adoption of large language models (LLMs) in recommender systems (RS) presents new challenges in understanding and evaluating their biases, which can result in unfairness or the amplification of stereotypes. Traditional fairness evaluations in RS primarily focus on collaborative filtering (CF) settings, which may not fully capture the complexities of LLMs, as these models often inherit biases from large, unregulated data. This paper proposes a normative framework to benchmark consumer fairness in LLM-powered recommender systems (RecLLMs). We critically examine how fairness norms in classical RS fall short in addressing the challenges posed by LLMs. We argue that this gap can lead to arbitrary conclusions about fairness, and we propose a more structured, formal approach to evaluate fairness in such systems. Our experiments on the MovieLens dataset on consumer fairness, using in-context learning (zero-shot vs. few-shot) reveal fairness deviations in age-based recommendations, particularly when additional contextual examples are introduced (ICL-2). Statistical significance tests confirm that these deviations are not random, highlighting the need for robust evaluation methods. While this work offers a preliminary discussion on a proposed normative framework, our hope is that it could provide a formal, principled approach for auditing and mitigating bias in RecLLMs. The code and dataset used for this work will be shared at "gihub-anonymized".

A Normative Framework for Benchmarking Consumer Fairness in Large Language Model Recommender System

TL;DR

This paper tackles fairness in RecLLMs by introducing a normative framework that distinguishes fairness when sensitive attributes are used ( vs ) from fairness relative to target distributions via Intrinsic Fairness (IF). It formalizes the evaluation around three metrics—Neutral vs. Sensitive Ranker Deviation (), Neutral vs. Counterfactual Sensitive Deviation (), and IF—grounded in benefit measures and deviations , with in-context learning variants (0-shot, ICL-1, ICL-2) applied to MovieLens data. Empirical results show gender fairness remains largely stable, while age-based fairness can be significantly affected by contextual information, especially under ICL-2 conditions, highlighting the risk of bias amplification with richer prompts. The work provides a principled, transparent auditing framework for RecLLMs, enabling practitioners to select appropriate reference rankers, define benefit types, and assess statistical significance to guide mitigation efforts.

Abstract

The rapid adoption of large language models (LLMs) in recommender systems (RS) presents new challenges in understanding and evaluating their biases, which can result in unfairness or the amplification of stereotypes. Traditional fairness evaluations in RS primarily focus on collaborative filtering (CF) settings, which may not fully capture the complexities of LLMs, as these models often inherit biases from large, unregulated data. This paper proposes a normative framework to benchmark consumer fairness in LLM-powered recommender systems (RecLLMs). We critically examine how fairness norms in classical RS fall short in addressing the challenges posed by LLMs. We argue that this gap can lead to arbitrary conclusions about fairness, and we propose a more structured, formal approach to evaluate fairness in such systems. Our experiments on the MovieLens dataset on consumer fairness, using in-context learning (zero-shot vs. few-shot) reveal fairness deviations in age-based recommendations, particularly when additional contextual examples are introduced (ICL-2). Statistical significance tests confirm that these deviations are not random, highlighting the need for robust evaluation methods. While this work offers a preliminary discussion on a proposed normative framework, our hope is that it could provide a formal, principled approach for auditing and mitigating bias in RecLLMs. The code and dataset used for this work will be shared at "gihub-anonymized".
Paper Structure (8 sections, 1 equation, 1 figure, 3 tables)

This paper contains 8 sections, 1 equation, 1 figure, 3 tables.

Figures (1)

  • Figure 1: This figure illustrates the direct use of Large Language Models (LLMs) in generating personalized recommendations. It compares outputs under neutral conditions with those generated under scenarios that consider sensitive attributes.