Joint Evaluation of Fairness and Relevance in Recommender Systems with Pareto Frontier
Theresia Veronika Rampisela, Tuukka Ruotsalo, Maria Maistro, Christina Lioma
TL;DR
This work introduces DPFR, a Pareto-front-based method for jointly evaluating relevance and fairness in recommender systems. It constructs a Pareto frontier from test-data scores using separate relevance and fairness measures and then measures the distance to a reference frontier point determined by α, enabling principled trade-offs between fairness and relevance. Across six real-world datasets and multiple model/rerank configurations, DPFR reveals that traditional Rel and Fair+Rel metrics yield inconsistent rankings and can mislead conclusions, while DPFR provides a robust, interpretable joint assessment and can be computed efficiently via frontier estimation. The findings suggest adopting α-weighted pairs like MAP-Ent or NDCG-Ent for more reliable joint evaluation and highlight opportunities to refine fair exposure measures and frontier estimation techniques for scalability and applicability in offline RS evaluation.
Abstract
Fairness and relevance are two important aspects of recommender systems (RSs). Typically, they are evaluated either (i) separately by individual measures of fairness and relevance, or (ii) jointly using a single measure that accounts for fairness with respect to relevance. However, approach (i) often does not provide a reliable joint estimate of the goodness of the models, as it has two different best models: one for fairness and another for relevance. Approach (ii) is also problematic because these measures tend to be ad-hoc and do not relate well to traditional relevance measures, like NDCG. Motivated by this, we present a new approach for jointly evaluating fairness and relevance in RSs: Distance to Pareto Frontier (DPFR). Given some user-item interaction data, we compute their Pareto frontier for a pair of existing relevance and fairness measures, and then use the distance from the frontier as a measure of the jointly achievable fairness and relevance. Our approach is modular and intuitive as it can be computed with existing measures. Experiments with 4 RS models, 3 re-ranking strategies, and 6 datasets show that existing metrics have inconsistent associations with our Pareto-optimal solution, making DPFR a more robust and theoretically well-founded joint measure for assessing fairness and relevance. Our code: https://github.com/theresiavr/DPFR-recsys-evaluation
