Can We Trust Recommender System Fairness Evaluation? The Role of Fairness and Relevance
Theresia Veronika Rampisela, Tuukka Ruotsalo, Maria Maistro, Christina Lioma
TL;DR
The paper tackles the challenge of trusting joint fairness-and-relevance measures for recommender systems by conducting a large-scale empirical study across four real-world datasets and four recommenders. It surveys all joint Fair+Rel measures up to Oct 2023, analyzes their agreement with relevance and fairness-only metrics, and assesses their sensitivity to rank changes and to simultaneous improvements in relevance and fairness. Key findings show that most joint measures either align with relevance or with fairness, or behave inconsistently, with many exhibiting extremely small score ranges and insensitivity to ranking dynamics. Based on these results, the authors offer practical guidelines: avoid overreliance on joint measures, prefer measuring relevance and fairness separately, and select representative measures from distinct clusters while noting computational trade-offs. The work highlights the fundamental difficulty of designing a single, balanced joint metric and motivates future work on more robust, informative measures or joint optimization approaches that truly reflect both objectives.
Abstract
Relevance and fairness are two major objectives of recommender systems (RSs). Recent work proposes measures of RS fairness that are either independent from relevance (fairness-only) or conditioned on relevance (joint measures). While fairness-only measures have been studied extensively, we look into whether joint measures can be trusted. We collect all joint evaluation measures of RS relevance and fairness, and ask: How much do they agree with each other? To what extent do they agree with relevance/fairness measures? How sensitive are they to changes in rank position, or to increasingly fair and relevant recommendations? We empirically study for the first time the behaviour of these measures across 4 real-world datasets and 4 recommenders. We find that most of these measures: i) correlate weakly with one another and even contradict each other at times; ii) are less sensitive to rank position changes than relevance- and fairness-only measures, meaning that they are less granular than traditional RS measures; and iii) tend to compress scores at the low end of their range, meaning that they are not very expressive. We counter the above limitations with a set of guidelines on the appropriate usage of such measures, i.e., they should be used with caution due to their tendency to contradict each other and of having a very small empirical range.
