Properties of Group Fairness Metrics for Rankings
Tobias Schumacher, Marlene Lutz, Sandipan Sikdar, Markus Strohmaier
TL;DR
This work addresses how to evaluate group fairness in rankings by proposing an axiomatic framework of 13 properties and applying it to 11 existing metrics. It distinguishes two shared settings—full-population ranking and subset retrieval—then analyzes which properties each metric satisfies both theoretically and empirically. The findings show that most metrics satisfy only a subset of properties, with exposure-based metrics generally performing better in the subset-ranking setting and prefix/AWRF metrics failing several universal properties. The study provides practical guidance for practitioners on metric selection tailored to application context and highlights fundamental limitations in current fair-ranking metrics. Overall, the paper offers a principled lens for interpreting fairness metrics and motivates future refinement of evaluation tools for fair rankings.
Abstract
In recent years, several metrics have been developed for evaluating group fairness of rankings. Given that these metrics were developed with different application contexts and ranking algorithms in mind, it is not straightforward which metric to choose for a given scenario. In this paper, we perform a comprehensive comparative analysis of existing group fairness metrics developed in the context of fair ranking. By virtue of their diverse application contexts, we argue that such a comparative analysis is not straightforward. Hence, we take an axiomatic approach whereby we design a set of thirteen properties for group fairness metrics that consider different ranking settings. A metric can then be selected depending on whether it satisfies all or a subset of these properties. We apply these properties on eleven existing group fairness metrics, and through both empirical and theoretical results we demonstrate that most of these metrics only satisfy a small subset of the proposed properties. These findings highlight limitations of existing metrics, and provide insights into how to evaluate and interpret different fairness metrics in practical deployment. The proposed properties can also assist practitioners in selecting appropriate metrics for evaluating fairness in a specific application.
