Table of Contents
Fetching ...

Evaluation Measures of Individual Item Fairness for Recommender Systems: A Critical Study

Theresia Veronika Rampisela, Maria Maistro, Tuukka Ruotsalo, Christina Lioma

TL;DR

This paper provides a critical, unified examination of eight exposure-based, individual-item fairness measures in recommender systems, identifying fundamental theoretical limitations such as non-realisability and edge-case undefinedness. It introduces principled corrections and normalization to enable reliable absolute fairness interpretation, while also offering guidance on when to use original versus corrected measures. Through extensive experiments on six real and synthetic datasets with seven recommender models, the authors show that corrected measures preserve ranking while improving interpretability, and reveal nuanced interactions between fairness and relevance. The work delivers practical guidelines and publicly available code to help researchers and practitioners select appropriate measures and apply corrections, advancing the reliable evaluation of item-level fairness in RSs.

Abstract

Fairness is an emerging and challenging topic in recommender systems. In recent years, various ways of evaluating and therefore improving fairness have emerged. In this study, we examine existing evaluation measures of fairness in recommender systems. Specifically, we focus solely on exposure-based fairness measures of individual items that aim to quantify the disparity in how individual items are recommended to users, separate from item relevance to users. We gather all such measures and we critically analyse their theoretical properties. We identify a series of limitations in each of them, which collectively may render the affected measures hard or impossible to interpret, to compute, or to use for comparing recommendations. We resolve these limitations by redefining or correcting the affected measures, or we argue why certain limitations cannot be resolved. We further perform a comprehensive empirical analysis of both the original and our corrected versions of these fairness measures, using real-world and synthetic datasets. Our analysis provides novel insights into the relationship between measures based on different fairness concepts, and different levels of measure sensitivity and strictness. We conclude with practical suggestions of which fairness measures should be used and when. Our code is publicly available. To our knowledge, this is the first critical comparison of individual item fairness measures in recommender systems.

Evaluation Measures of Individual Item Fairness for Recommender Systems: A Critical Study

TL;DR

This paper provides a critical, unified examination of eight exposure-based, individual-item fairness measures in recommender systems, identifying fundamental theoretical limitations such as non-realisability and edge-case undefinedness. It introduces principled corrections and normalization to enable reliable absolute fairness interpretation, while also offering guidance on when to use original versus corrected measures. Through extensive experiments on six real and synthetic datasets with seven recommender models, the authors show that corrected measures preserve ranking while improving interpretability, and reveal nuanced interactions between fairness and relevance. The work delivers practical guidelines and publicly available code to help researchers and practitioners select appropriate measures and apply corrections, advancing the reliable evaluation of item-level fairness in RSs.

Abstract

Fairness is an emerging and challenging topic in recommender systems. In recent years, various ways of evaluating and therefore improving fairness have emerged. In this study, we examine existing evaluation measures of fairness in recommender systems. Specifically, we focus solely on exposure-based fairness measures of individual items that aim to quantify the disparity in how individual items are recommended to users, separate from item relevance to users. We gather all such measures and we critically analyse their theoretical properties. We identify a series of limitations in each of them, which collectively may render the affected measures hard or impossible to interpret, to compute, or to use for comparing recommendations. We resolve these limitations by redefining or correcting the affected measures, or we argue why certain limitations cannot be resolved. We further perform a comprehensive empirical analysis of both the original and our corrected versions of these fairness measures, using real-world and synthetic datasets. Our analysis provides novel insights into the relationship between measures based on different fairness concepts, and different levels of measure sensitivity and strictness. We conclude with practical suggestions of which fairness measures should be used and when. Our code is publicly available. To our knowledge, this is the first critical comparison of individual item fairness measures in recommender systems.
Paper Structure (55 sections, 2 theorems, 36 equations, 18 figures, 11 tables)

This paper contains 55 sections, 2 theorems, 36 equations, 18 figures, 11 tables.

Key Result

theorem 1

If there is only one pair of $(i,i')\in A$, $\text{VoCD}_{\max}$ is obtained when $\sum\limits_{u\in U}1_{R_{u}^{k}}(i)=1$ and $\sum\limits_{u\in U}1_{R_{u}^{k}}(i')=m$

Figures (18)

  • Figure 1: Correlation (Kendall's $\tau$) between relevance and fairness measures for Lastfm. Asterisk ($^*$) denotes a statistically significant correlation ($\alpha=0.05$), after applying the Benjamini-Hochberg procedure.
  • Figure 2: Correlation (Kendall's $\tau$) between relevance and fairness measures for Ml-1m. Asterisk ($^*$) denotes a statistically significant correlation ($\alpha=0.05$), after applying the Benjamini-Hochberg procedure.
  • Figure 3: Most fair scores with varying $k$ for higher-is-fairer fairness measures for Lastfm and Ml-1m. All scores from the corrected measures (denoted by 'our') measures overlap with each other.
  • Figure 5: Most unfair scores with varying $k$ for higher-is-fairer fairness measures for Lastfm and Ml-1m. On Repeatable MostUnfair, all scores from the corrected measures (denoted by 'our') overlap with each other for the shown values of $k>1$ for Lastfm and for all shown values of $k$ for Ml-1m.
  • Figure 7: Sliding window evaluation for BPR model, on Lastfm and Ml-1m. Each row of figures is for one dataset, each column is for the different groups of measures (relevance, higher-is-better fairness, lower-is-better fairness measures). II-D and AI-D lines overlap.
  • ...and 13 more figures

Theorems & Definitions (2)

  • theorem 1
  • theorem 2