Evaluation Measures of Individual Item Fairness for Recommender Systems: A Critical Study
Theresia Veronika Rampisela, Maria Maistro, Tuukka Ruotsalo, Christina Lioma
TL;DR
This paper provides a critical, unified examination of eight exposure-based, individual-item fairness measures in recommender systems, identifying fundamental theoretical limitations such as non-realisability and edge-case undefinedness. It introduces principled corrections and normalization to enable reliable absolute fairness interpretation, while also offering guidance on when to use original versus corrected measures. Through extensive experiments on six real and synthetic datasets with seven recommender models, the authors show that corrected measures preserve ranking while improving interpretability, and reveal nuanced interactions between fairness and relevance. The work delivers practical guidelines and publicly available code to help researchers and practitioners select appropriate measures and apply corrections, advancing the reliable evaluation of item-level fairness in RSs.
Abstract
Fairness is an emerging and challenging topic in recommender systems. In recent years, various ways of evaluating and therefore improving fairness have emerged. In this study, we examine existing evaluation measures of fairness in recommender systems. Specifically, we focus solely on exposure-based fairness measures of individual items that aim to quantify the disparity in how individual items are recommended to users, separate from item relevance to users. We gather all such measures and we critically analyse their theoretical properties. We identify a series of limitations in each of them, which collectively may render the affected measures hard or impossible to interpret, to compute, or to use for comparing recommendations. We resolve these limitations by redefining or correcting the affected measures, or we argue why certain limitations cannot be resolved. We further perform a comprehensive empirical analysis of both the original and our corrected versions of these fairness measures, using real-world and synthetic datasets. Our analysis provides novel insights into the relationship between measures based on different fairness concepts, and different levels of measure sensitivity and strictness. We conclude with practical suggestions of which fairness measures should be used and when. Our code is publicly available. To our knowledge, this is the first critical comparison of individual item fairness measures in recommender systems.
