Table of Contents
Fetching ...

Joint Evaluation of Fairness and Relevance in Recommender Systems with Pareto Frontier

Theresia Veronika Rampisela, Tuukka Ruotsalo, Maria Maistro, Christina Lioma

TL;DR

This work introduces DPFR, a Pareto-front-based method for jointly evaluating relevance and fairness in recommender systems. It constructs a Pareto frontier from test-data scores using separate relevance and fairness measures and then measures the distance to a reference frontier point determined by α, enabling principled trade-offs between fairness and relevance. Across six real-world datasets and multiple model/rerank configurations, DPFR reveals that traditional Rel and Fair+Rel metrics yield inconsistent rankings and can mislead conclusions, while DPFR provides a robust, interpretable joint assessment and can be computed efficiently via frontier estimation. The findings suggest adopting α-weighted pairs like MAP-Ent or NDCG-Ent for more reliable joint evaluation and highlight opportunities to refine fair exposure measures and frontier estimation techniques for scalability and applicability in offline RS evaluation.

Abstract

Fairness and relevance are two important aspects of recommender systems (RSs). Typically, they are evaluated either (i) separately by individual measures of fairness and relevance, or (ii) jointly using a single measure that accounts for fairness with respect to relevance. However, approach (i) often does not provide a reliable joint estimate of the goodness of the models, as it has two different best models: one for fairness and another for relevance. Approach (ii) is also problematic because these measures tend to be ad-hoc and do not relate well to traditional relevance measures, like NDCG. Motivated by this, we present a new approach for jointly evaluating fairness and relevance in RSs: Distance to Pareto Frontier (DPFR). Given some user-item interaction data, we compute their Pareto frontier for a pair of existing relevance and fairness measures, and then use the distance from the frontier as a measure of the jointly achievable fairness and relevance. Our approach is modular and intuitive as it can be computed with existing measures. Experiments with 4 RS models, 3 re-ranking strategies, and 6 datasets show that existing metrics have inconsistent associations with our Pareto-optimal solution, making DPFR a more robust and theoretically well-founded joint measure for assessing fairness and relevance. Our code: https://github.com/theresiavr/DPFR-recsys-evaluation

Joint Evaluation of Fairness and Relevance in Recommender Systems with Pareto Frontier

TL;DR

This work introduces DPFR, a Pareto-front-based method for jointly evaluating relevance and fairness in recommender systems. It constructs a Pareto frontier from test-data scores using separate relevance and fairness measures and then measures the distance to a reference frontier point determined by α, enabling principled trade-offs between fairness and relevance. Across six real-world datasets and multiple model/rerank configurations, DPFR reveals that traditional Rel and Fair+Rel metrics yield inconsistent rankings and can mislead conclusions, while DPFR provides a robust, interpretable joint assessment and can be computed efficiently via frontier estimation. The findings suggest adopting α-weighted pairs like MAP-Ent or NDCG-Ent for more reliable joint evaluation and highlight opportunities to refine fair exposure measures and frontier estimation techniques for scalability and applicability in offline RS evaluation.

Abstract

Fairness and relevance are two important aspects of recommender systems (RSs). Typically, they are evaluated either (i) separately by individual measures of fairness and relevance, or (ii) jointly using a single measure that accounts for fairness with respect to relevance. However, approach (i) often does not provide a reliable joint estimate of the goodness of the models, as it has two different best models: one for fairness and another for relevance. Approach (ii) is also problematic because these measures tend to be ad-hoc and do not relate well to traditional relevance measures, like NDCG. Motivated by this, we present a new approach for jointly evaluating fairness and relevance in RSs: Distance to Pareto Frontier (DPFR). Given some user-item interaction data, we compute their Pareto frontier for a pair of existing relevance and fairness measures, and then use the distance from the frontier as a measure of the jointly achievable fairness and relevance. Our approach is modular and intuitive as it can be computed with existing measures. Experiments with 4 RS models, 3 re-ranking strategies, and 6 datasets show that existing metrics have inconsistent associations with our Pareto-optimal solution, making DPFR a more robust and theoretically well-founded joint measure for assessing fairness and relevance. Our code: https://github.com/theresiavr/DPFR-recsys-evaluation

Paper Structure

This paper contains 30 sections, 4 figures, 11 tables.

Figures (4)

  • Figure 1: $(x, y)$ denotes the pair of relevance and fairness score. Example: Model A is best for fairness, Model B is best for relevance, and Model C is the closest to the Pareto Frontier (PF) midpoint, when relevance and fairness are equally weighted ($\alpha=0.5$). Averaging relevance and fairness (Avg) leads to falsely concluding that Model A is best for both aspects. Note that distance to PF also beats other existing measures of fairness and relevance (see $\S$\ref{['ss:corr']}).
  • Figure 2: Pareto Frontier of fairness and relevance (in blue) and recommender scores for Lastfm and QK-video on exponential-like scales. Rel, Fair, Avg (mean of Rel, Fair), and DPFR are the best model per evaluation approach.
  • Figure 3: Kendall's $\tau$ correlation heatmap between the rank ordering of existing joint evaluation measures (including the average of Fair and Rel scores, avg), and DPFR.
  • Figure 4: Pareto Frontier of fairness and relevance (in blue), together with recommender model scores for Amazon-lb, Jester, and ML-*. Fair measures are on the $y$-axis and Rel measures are on the $x$-axis. We implement exponential-like scales to enhance the visibility of the model plots. The Rel, Fair, Avg, and DPFR denote the best model based on each evaluation approach.