Can We Trust Recommender System Fairness Evaluation? The Role of Fairness and Relevance

Theresia Veronika Rampisela; Tuukka Ruotsalo; Maria Maistro; Christina Lioma

Can We Trust Recommender System Fairness Evaluation? The Role of Fairness and Relevance

Theresia Veronika Rampisela, Tuukka Ruotsalo, Maria Maistro, Christina Lioma

TL;DR

The paper tackles the challenge of trusting joint fairness-and-relevance measures for recommender systems by conducting a large-scale empirical study across four real-world datasets and four recommenders. It surveys all joint Fair+Rel measures up to Oct 2023, analyzes their agreement with relevance and fairness-only metrics, and assesses their sensitivity to rank changes and to simultaneous improvements in relevance and fairness. Key findings show that most joint measures either align with relevance or with fairness, or behave inconsistently, with many exhibiting extremely small score ranges and insensitivity to ranking dynamics. Based on these results, the authors offer practical guidelines: avoid overreliance on joint measures, prefer measuring relevance and fairness separately, and select representative measures from distinct clusters while noting computational trade-offs. The work highlights the fundamental difficulty of designing a single, balanced joint metric and motivates future work on more robust, informative measures or joint optimization approaches that truly reflect both objectives.

Abstract

Relevance and fairness are two major objectives of recommender systems (RSs). Recent work proposes measures of RS fairness that are either independent from relevance (fairness-only) or conditioned on relevance (joint measures). While fairness-only measures have been studied extensively, we look into whether joint measures can be trusted. We collect all joint evaluation measures of RS relevance and fairness, and ask: How much do they agree with each other? To what extent do they agree with relevance/fairness measures? How sensitive are they to changes in rank position, or to increasingly fair and relevant recommendations? We empirically study for the first time the behaviour of these measures across 4 real-world datasets and 4 recommenders. We find that most of these measures: i) correlate weakly with one another and even contradict each other at times; ii) are less sensitive to rank position changes than relevance- and fairness-only measures, meaning that they are less granular than traditional RS measures; and iii) tend to compress scores at the low end of their range, meaning that they are not very expressive. We counter the above limitations with a set of guidelines on the appropriate usage of such measures, i.e., they should be used with caution due to their tendency to contradict each other and of having a very small empirical range.

Can We Trust Recommender System Fairness Evaluation? The Role of Fairness and Relevance

TL;DR

Abstract

Paper Structure (20 sections, 24 equations, 3 figures, 3 tables)

This paper contains 20 sections, 24 equations, 3 figures, 3 tables.

Introduction
Individual item fairness & relevance
Notation and definitions
Joint measures of fairness and relevance
Inequity of Amortized Attention (IAA) Biega2018EquityRankings
Individual Fairness Disparity (IFD) Singh2019PolicyRankingOosterhuis2021ComputationallyFairness
Hellinger Distance (HD) Jeunen2021Top-KExposure
Mean Max Envy (MME) Saito2022FairRanking
Item Better-Off (IBO) & Item Worse-Off (IWO) Saito2022FairRanking
Individual-user-to-individual-item fairness (II-F) Wu2022JointRecommendation
All-users-to-individual-item fairness (AI-F) Wu2022JointRecommendation
Experimental setup
Empirical analysis
Evaluation results of all measures
Correlation between measures (RQ1 & RQ2)
...and 5 more sections

Figures (3)

Figure 1: Kendall's $\tau$ correlation between joint Fair+Rel measures, Rel, and Fair measures.
Figure 2: Sliding window evaluation ($\mathbf{k=5}$) of NCL for Lastfm, Amazon-lb, and ML-10M. The last column is in exponential scale.
Figure 3: Artificial insertion of items with $\mathbf{m=1000}$ (users).

Can We Trust Recommender System Fairness Evaluation? The Role of Fairness and Relevance

TL;DR

Abstract

Can We Trust Recommender System Fairness Evaluation? The Role of Fairness and Relevance

Authors

TL;DR

Abstract

Table of Contents

Figures (3)