Table of Contents
Fetching ...

Properties of Group Fairness Metrics for Rankings

Tobias Schumacher, Marlene Lutz, Sandipan Sikdar, Markus Strohmaier

TL;DR

This work addresses how to evaluate group fairness in rankings by proposing an axiomatic framework of 13 properties and applying it to 11 existing metrics. It distinguishes two shared settings—full-population ranking and subset retrieval—then analyzes which properties each metric satisfies both theoretically and empirically. The findings show that most metrics satisfy only a subset of properties, with exposure-based metrics generally performing better in the subset-ranking setting and prefix/AWRF metrics failing several universal properties. The study provides practical guidance for practitioners on metric selection tailored to application context and highlights fundamental limitations in current fair-ranking metrics. Overall, the paper offers a principled lens for interpreting fairness metrics and motivates future refinement of evaluation tools for fair rankings.

Abstract

In recent years, several metrics have been developed for evaluating group fairness of rankings. Given that these metrics were developed with different application contexts and ranking algorithms in mind, it is not straightforward which metric to choose for a given scenario. In this paper, we perform a comprehensive comparative analysis of existing group fairness metrics developed in the context of fair ranking. By virtue of their diverse application contexts, we argue that such a comparative analysis is not straightforward. Hence, we take an axiomatic approach whereby we design a set of thirteen properties for group fairness metrics that consider different ranking settings. A metric can then be selected depending on whether it satisfies all or a subset of these properties. We apply these properties on eleven existing group fairness metrics, and through both empirical and theoretical results we demonstrate that most of these metrics only satisfy a small subset of the proposed properties. These findings highlight limitations of existing metrics, and provide insights into how to evaluate and interpret different fairness metrics in practical deployment. The proposed properties can also assist practitioners in selecting appropriate metrics for evaluating fairness in a specific application.

Properties of Group Fairness Metrics for Rankings

TL;DR

This work addresses how to evaluate group fairness in rankings by proposing an axiomatic framework of 13 properties and applying it to 11 existing metrics. It distinguishes two shared settings—full-population ranking and subset retrieval—then analyzes which properties each metric satisfies both theoretically and empirically. The findings show that most metrics satisfy only a subset of properties, with exposure-based metrics generally performing better in the subset-ranking setting and prefix/AWRF metrics failing several universal properties. The study provides practical guidance for practitioners on metric selection tailored to application context and highlights fundamental limitations in current fair-ranking metrics. Overall, the paper offers a principled lens for interpreting fairness metrics and motivates future refinement of evaluation tools for fair rankings.

Abstract

In recent years, several metrics have been developed for evaluating group fairness of rankings. Given that these metrics were developed with different application contexts and ranking algorithms in mind, it is not straightforward which metric to choose for a given scenario. In this paper, we perform a comprehensive comparative analysis of existing group fairness metrics developed in the context of fair ranking. By virtue of their diverse application contexts, we argue that such a comparative analysis is not straightforward. Hence, we take an axiomatic approach whereby we design a set of thirteen properties for group fairness metrics that consider different ranking settings. A metric can then be selected depending on whether it satisfies all or a subset of these properties. We apply these properties on eleven existing group fairness metrics, and through both empirical and theoretical results we demonstrate that most of these metrics only satisfy a small subset of the proposed properties. These findings highlight limitations of existing metrics, and provide insights into how to evaluate and interpret different fairness metrics in practical deployment. The proposed properties can also assist practitioners in selecting appropriate metrics for evaluating fairness in a specific application.
Paper Structure (18 sections, 24 theorems, 19 equations, 7 figures)

This paper contains 18 sections, 24 theorems, 19 equations, 7 figures.

Key Result

Theorem 1

None of the prefix metrics satisfy property 12 (deepness threshold) and property 13 (sensitivity).

Figures (7)

  • Figure 1: Universal properties for group fairness metrics. We illustrate six universal properties using exemplary ranking scenarios in panels (a) - (f). In each scenario, a ranking candidate belongs to either a protected or a non-protected group. In (a), property 1 requires that a fairness metric $m$ is able to reflect whether the protected group or the non-protected group is disadvantaged. The example shows that normalized discounted difference (rND) does not satisfy this criterion, as it assigns values lower than the optimal value $v_{\operatorname{opt}}$ when either group is disadvantaged. In (b), property 2 requires that $m$ is bounded. Our example shows a set of rankings where with increasing ranking length, exposure ratio (ER) goes to infinity. In (c), property 3 is satisfied if swapping a non-protected candidate with a lower ranking protected ranking of at least equal relevance increases the score of $m$. In the given example, rND however decreases its value for such a swap. In (d), property 4 requires that the value of $m$ should be impacted more when swapping candidates at higher ranks than when swapping at lower ranks. In the given example, pairwise statistical parity (PSP) however weighs these swaps the same. In (e), a metric $m$ satisfies property 5 if swapping candidates in concordance to their relevance scores increases the score if both are in the protected, and decreases the score if both are in the non-protected group. In the given example, disparate treatment ratio DTR is however not affected at all by such a swap. In (f), property 6 requires that $m$ is invariant to linear transformations of relevance scores. Here, the rankings $r$ and $r'$ are the same, except that the relevance scores $y'$ in $r'$ are a min-max scaled version of the scores $y$ in $r$. By assigning different scores to $r$ and $r'$, we illustrate that disparate treatment difference (DTD) does not satisfy property 6.
  • Figure 2: Properties for ranking the full population. We illustrate the four properties for Setting 1 in which the full population is ranked, using exemplary ranking scenarios in panels (a) - (d). In each scenario, a ranking candidate belongs to either a protected or a non-protected group. In (a), property 7 requires that when sampling uniformly over all rankings of a candidate population $\mathcal{D}$, the ranking metric $m$ should, on expectation, yield the optimal fairness score $v_{\operatorname{opt}}(m)$. We provide a minimal example in which exposure ratio (ER) does not obtain its optimal value $v_{\operatorname{opt}}(ER) = 1$ on expectation. In (b), property 8 requires that a fairness metric $m$ is invariant to ranking length. The example shows that ER fails to satisfy this criterion, as it assigns different values to rankings that only differ in length. In (c), property 9 requires that $m$ is invariant to group proportions. While the rankings in our example are intuitively similar (with the only difference being the size of the protected group), exposure difference (ED) assigns different fairness scores to them. In (d), property 10 stipulates $m$ to assign symmetric penalties to all groups. In the given example, $r$ disadvantages $G_1$ in the strongest possible way, while $r'$ disadvantages $G_0$ in the strongest possible way. However, ED assigns asymmetric scores to the two rankings, thereby failing to satisfy property 10.
  • Figure 3: Properties for ranking subsets of a population. Provided that only a subset of the candidate population is ranked, we illustrate three properties using exemplary ranking scenarios in panels (a)-(c). In each scenario, a ranking candidate belongs to either a protected or a non-protected group. In (a) and (b), the given properties 11 and 12 consider and compare two kinds of rankings with identical, but varying length $n=2N$. The first type of ranking contains only a single protected candidate that is however ranked at position 1, whereas in the second type of ranking, 50% of its candidates are from the protected group, but contrarily, all of these placed at the latter half of the ranking. Now property 11 requires that for sufficiently small $N$, a metric $m$ always assigns a higher score for the first kind of rankings. Contrarily, property 12 requires that for sufficiently large $N$, a metric $m$ always assigns a higher score for the second kind of rankings. Our examples show that both of these properties are satisfied by exposure ratio (ER). In (c), property 13 requires that appending a candidate of the non-protected group at the end of any ranking always results in lower fairness with respect to $m$. The given example shows that this does not hold for rND.
  • Figure 4: Property 8: invariance to ranking length. We show the behavior of fairness metrics for varying ranking length $n$. Last (blue) represents rankings in which all candidates from the protected group are ranked higher than each candidate from the non-protected group. Conversely, first (yellow) represents rankings in which all candidates from the protected group are ranked lower than each candidate from the non-protected group. The proportion of the protected group is fixed at $p_{G_1} = 0.3$, but qualitatively the results are the same for other values of $p_{G_1}$. If the markers of the different ranking types each remain at a constant value, the respective metric satisfies invariance to ranking length. We assume uniform relevance, therefore DTD, DID, DTR and DIR are equal to ED and ER, respectively. We observe that none of the shown metrics seem invariant to ranking length ($\textcolor{red}{\times}$). This implies that for all these metrics, the values obtained on rankings of different length are hardly comparable.
  • Figure 5: Property 9: invariance to group proportion, and property 10: symmetric penalties for all groups. We illustrate the behavior of fairness metrics for varying proportions of the protected group $p_{G_1}$. Last (blue) represents rankings in which all candidates from the protected group are ranked higher than each candidate from the non-protected group. Conversely, first (yellow) represents rankings in which all candidates from the protected group are ranked lower than each candidate from the non-protected group. The results are shown for a fixed ranking length of $n = 100$ but the results are qualitatively the same for other choices of $n$. Assuming uniform relevance, DTD, DID, DTR and DIR are equal to ED and ER, respectively. If the markers of the different ranking types each remain at a constant value, the respective metric satisfies invariance to group proportions. We can see that this is the case for none of the shown metrics ($\textcolor{red}{\times}$). This implies that these metrics should not be used to compare the fairness of rankings that differ in terms of group shares. With regard to property 10, a metric assigns symmetric penalties for all groups if the distance to the optimum is the same for the blue and yellow markers for a fixed group proportion. We observe that none of the shown metrics satisfy this property either.
  • ...and 2 more figures

Theorems & Definitions (37)

  • Definition
  • Definition
  • Definition
  • Definition
  • Definition
  • Definition
  • Definition
  • Definition
  • Definition
  • Definition
  • ...and 27 more