CFaiRLLM: Consumer Fairness Evaluation in Large-Language Model Recommender System
Yashar Deldjoo, Tommaso di Noia
TL;DR
This paper tackles fairness evaluation for RecLLMs by arguing that prior work conflated personalization with bias and overlooked whether deviations reflect true user preferences or stereotypes. It introduces CFaiRLLM, an enhanced framework that combines true preference alignment with intersectional fairness, and introduces diverse user-profile sampling strategies to address LLM token limits. The approach defines two rankers (neutral and sensitive-attribute influenced) and two benefits (item similarity and true preference alignment), and proposes metrics including JS@K, PRAG, SNSR, and SNSV, validated on MovieLens and LastFM. Key findings show that true preference alignment yields lower unfairness than pure similarity measures, intersectional attributes amplify fairness gaps—especially in music—and sophisticated sampling strategies (top-rated and recency-based) can mitigate biases while improving personalization, enabling more equitable RecLLMs in practice.
Abstract
This work takes a critical stance on previous studies concerning fairness evaluation in Large Language Model (LLM)-based recommender systems, which have primarily assessed consumer fairness by comparing recommendation lists generated with and without sensitive user attributes. Such approaches implicitly treat discrepancies in recommended items as biases, overlooking whether these changes might stem from genuine personalization aligned with the true preferences of users. Moreover, these earlier studies typically address single sensitive attributes in isolation, neglecting the complex interplay of intersectional identities. In response to these shortcomings, we introduce CFaiRLLM, an enhanced evaluation framework that not only incorporates true preference alignment but also rigorously examines intersectional fairness by considering overlapping sensitive attributes. Additionally, CFaiRLLM introduces diverse user profile sampling strategies-random, top-rated, and recency-focused-to better understand the impact of profile generation fed to LLMs in light of inherent token limitations in these systems. Given that fairness depends on accurately understanding users' tastes and preferences, these strategies provide a more realistic assessment of fairness within RecLLMs. To validate the efficacy of CFaiRLLM, we conducted extensive experiments using MovieLens and LastFM datasets, applying various sampling strategies and sensitive attribute configurations. The evaluation metrics include both item similarity measures and true preference alignment considering both hit and ranking (Jaccard Similarity and PRAG), thereby conducting a multifaceted analysis of recommendation fairness.
