CFaiRLLM: Consumer Fairness Evaluation in Large-Language Model Recommender System

Yashar Deldjoo; Tommaso di Noia

CFaiRLLM: Consumer Fairness Evaluation in Large-Language Model Recommender System

Yashar Deldjoo, Tommaso di Noia

TL;DR

This paper tackles fairness evaluation for RecLLMs by arguing that prior work conflated personalization with bias and overlooked whether deviations reflect true user preferences or stereotypes. It introduces CFaiRLLM, an enhanced framework that combines true preference alignment with intersectional fairness, and introduces diverse user-profile sampling strategies to address LLM token limits. The approach defines two rankers (neutral and sensitive-attribute influenced) and two benefits (item similarity and true preference alignment), and proposes metrics including JS@K, PRAG, SNSR, and SNSV, validated on MovieLens and LastFM. Key findings show that true preference alignment yields lower unfairness than pure similarity measures, intersectional attributes amplify fairness gaps—especially in music—and sophisticated sampling strategies (top-rated and recency-based) can mitigate biases while improving personalization, enabling more equitable RecLLMs in practice.

Abstract

This work takes a critical stance on previous studies concerning fairness evaluation in Large Language Model (LLM)-based recommender systems, which have primarily assessed consumer fairness by comparing recommendation lists generated with and without sensitive user attributes. Such approaches implicitly treat discrepancies in recommended items as biases, overlooking whether these changes might stem from genuine personalization aligned with the true preferences of users. Moreover, these earlier studies typically address single sensitive attributes in isolation, neglecting the complex interplay of intersectional identities. In response to these shortcomings, we introduce CFaiRLLM, an enhanced evaluation framework that not only incorporates true preference alignment but also rigorously examines intersectional fairness by considering overlapping sensitive attributes. Additionally, CFaiRLLM introduces diverse user profile sampling strategies-random, top-rated, and recency-focused-to better understand the impact of profile generation fed to LLMs in light of inherent token limitations in these systems. Given that fairness depends on accurately understanding users' tastes and preferences, these strategies provide a more realistic assessment of fairness within RecLLMs. To validate the efficacy of CFaiRLLM, we conducted extensive experiments using MovieLens and LastFM datasets, applying various sampling strategies and sensitive attribute configurations. The evaluation metrics include both item similarity measures and true preference alignment considering both hit and ranking (Jaccard Similarity and PRAG), thereby conducting a multifaceted analysis of recommendation fairness.

CFaiRLLM: Consumer Fairness Evaluation in Large-Language Model Recommender System

TL;DR

Abstract

Paper Structure (33 sections, 4 figures, 6 tables)

This paper contains 33 sections, 4 figures, 6 tables.

Introduction
Contributions.
Related work
Fairness in Recommender Systems
Core RS models and Stakeholder.
Stakeholder Considerations.
Leveraging Pre-trained LMs and Prompting for Recommender Systems
Proposed Evaluation Framework
CFairLLM: Consumer Fairness Evaluation RecLLM
Fairness Definition
Limitations and Our Contributions.
Definition of Rankers and Benefits
Independent vs. Intersectional Fairness
Evaluation Method
Data Format for User Instructions
...and 18 more sections

Figures (4)

Figure 1: In the left figure, we showcase CFaiRLLM's fairness evaluation in movie recommendations, comparing recommendation similarity across sensitive (gender, age) and intersectional attributes to a neutral standard, emphasizing user preferences. Our aim is equity, ensuring that sensitive attribute recommendations align with neutral benchmarks. The right details the sensitive attributes explored.
Figure 2: Fairness and Accuracy Metrics Across Sampling Strategies on ML-1M dataset.
Figure 3: Fairness and Accuracy Metrics Across Models and Datasets.
Figure 4: Heatmap Comparison of Recommendation Fairness and Similarity: The left heatmap shows the effect of increasing movie counts with random sampling, the middle heatmap depicts the outcome of using a top-rated sampling strategy, and the right heatmap presents fairness scores, highlighting the differential impact of sampling strategies on recommendation quality.

CFaiRLLM: Consumer Fairness Evaluation in Large-Language Model Recommender System

TL;DR

Abstract

CFaiRLLM: Consumer Fairness Evaluation in Large-Language Model Recommender System

Authors

TL;DR

Abstract

Table of Contents

Figures (4)