Table of Contents
Fetching ...

Finding a Fair Scoring Function for Top-$k$ Selection: From Hardness to Practice

Guangya Cai

TL;DR

This work studies fair top-$k$ selection under a linear scoring model, proving conditional hardness results that hinder universal scalable solutions in higher dimensions and motivating a two-pronged algorithmic strategy. For small $k$, a $k$-level-based method leverages duality and computational geometry to achieve practical speedups, while for large $k$, a MILP-based approach offers robust performance despite worst-case NP-hardness. Experimental evaluations on real datasets (e.g., COMPAS and IIT-JEE) show orders-of-magnitude improvements over state-of-the-art baselines and provide guidance on algorithm selection based on dimensionality and $k$. The study integrates hardness analysis, algorithm design, engineering optimization, and empirical evaluation to deliver a practically efficient solution with broad implications for fairness-aware decision-making systems.

Abstract

Selecting a subset of the $k$ "best" items from a dataset of $n$ items, based on a scoring function, is a key task in decision-making. Given the rise of automated decision-making software, it is important that the outcome of this process, called top-$k$ selection, is fair. Here we consider the problem of identifying a fair linear scoring function for top-$k$ selection. The function computes a score for each item as a weighted sum of its (numerical) attribute values, and must ensure that the selected subset includes adequate representation of a minority or historically disadvantaged group. Existing algorithms do not scale efficiently, particularly in higher dimensions. Our hardness analysis shows that in more than two dimensions, no algorithm is likely to achieve good scalability with respect to dataset size, and the computational complexity is likely to increase rapidly with dimensionality. However, the hardness results also provide key insights guiding algorithm design, leading to our two-pronged solution: (1) For small values of $k$, our hardness analysis reveals a gap in the hardness barrier. By addressing various engineering challenges, including achieving efficient parallelism, we turn this potential of efficiency into an optimized algorithm delivering substantial practical performance gains. (2) For large values of $k$, where the hardness is robust, we employ a practically efficient algorithm which, despite being theoretically worse, achieves superior real-world performance. Experimental evaluations on real-world datasets then explore scenarios where worst-case behavior does not manifest, identifying areas critical to practical performance. Our solution achieves speed-ups of up to several orders of magnitude compared to SOTA, an efficiency made possible through a tight integration of hardness analysis, algorithm design, practical engineering, and empirical evaluation.

Finding a Fair Scoring Function for Top-$k$ Selection: From Hardness to Practice

TL;DR

This work studies fair top- selection under a linear scoring model, proving conditional hardness results that hinder universal scalable solutions in higher dimensions and motivating a two-pronged algorithmic strategy. For small , a -level-based method leverages duality and computational geometry to achieve practical speedups, while for large , a MILP-based approach offers robust performance despite worst-case NP-hardness. Experimental evaluations on real datasets (e.g., COMPAS and IIT-JEE) show orders-of-magnitude improvements over state-of-the-art baselines and provide guidance on algorithm selection based on dimensionality and . The study integrates hardness analysis, algorithm design, engineering optimization, and empirical evaluation to deliver a practically efficient solution with broad implications for fairness-aware decision-making systems.

Abstract

Selecting a subset of the "best" items from a dataset of items, based on a scoring function, is a key task in decision-making. Given the rise of automated decision-making software, it is important that the outcome of this process, called top- selection, is fair. Here we consider the problem of identifying a fair linear scoring function for top- selection. The function computes a score for each item as a weighted sum of its (numerical) attribute values, and must ensure that the selected subset includes adequate representation of a minority or historically disadvantaged group. Existing algorithms do not scale efficiently, particularly in higher dimensions. Our hardness analysis shows that in more than two dimensions, no algorithm is likely to achieve good scalability with respect to dataset size, and the computational complexity is likely to increase rapidly with dimensionality. However, the hardness results also provide key insights guiding algorithm design, leading to our two-pronged solution: (1) For small values of , our hardness analysis reveals a gap in the hardness barrier. By addressing various engineering challenges, including achieving efficient parallelism, we turn this potential of efficiency into an optimized algorithm delivering substantial practical performance gains. (2) For large values of , where the hardness is robust, we employ a practically efficient algorithm which, despite being theoretically worse, achieves superior real-world performance. Experimental evaluations on real-world datasets then explore scenarios where worst-case behavior does not manifest, identifying areas critical to practical performance. Our solution achieves speed-ups of up to several orders of magnitude compared to SOTA, an efficiency made possible through a tight integration of hardness analysis, algorithm design, practical engineering, and empirical evaluation.

Paper Structure

This paper contains 24 sections, 8 theorems, 7 equations, 8 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

Fair Top-$k$ Selection is 3SUM-Hard for any $d\geq 3$.

Figures (8)

  • Figure 1: The structure and interplay of key results and components in this work. Solid arrows denote the primary workflow and direct influences, while dashed arrows indicate feedback. Dashed boxes group related components presented in the same subsection.
  • Figure 2: Dual transformation in 2-D, mapping a point $p_i$ to a line $l_i$. Each (open) line segment is a cell in 2-D and bold line segments in (b) show the $k$-level for $k = 1$ in 2-D.
  • Figure 3: Maximum kinetic tournament tree (right) for lines (left) at $x = t$. Additional auxiliary info. stored at internal nodes are omitted. For an internal node, the associated line (e.g, $l_1$ at the root) is the Top() for the (sub)tree rooted at that node at the given time (value of $x$), and the number is the closest time instant (in the future) for any nodes of the (sub)tree when the Top() of its left child intersects the Top() of its right child. The Replace() can be implemented by substituting the target line (i.e., $l_1$ in this example) at the leaf level and propagating updates along the path to the root.
  • Figure 4: (a) Three lines, $l_1 : y = m_1x + b_1$, $l_2 : y = m_2x + b_2$ and $l_3 : y = m_3x + b_3$ intersect at $x = t$. The tie is broken by perturbing $x$ to $x = t + \epsilon$, where $\epsilon$ is infinitesimally small and positive. This, in fact, sorts the lines by values of $m_i$, and orders $l_3$ ahead of $l_1$ and $l_2$ for a max-queue. (b) For $k = 3$, we have $l_1, l_2 \in S_1$ (red) and $l_3, l_4, l_5 \in S_2$ (blue) before the intersection. At $x = t + \epsilon$, we should have $l_5, l_4 \in S_1$ and $l_3, l_2, l_1 \in S_2$. Since $S_1.$Top() is $l_2$ and $S_2.$Top() is $l_3$ before the intersection, the algorithm finds $l_1$ in $S_1$ and $l_5$ in $S_2$ by repeatedly calling Advance(). The two lines are then exchanged. Now, we have $S_1.$Top() is $l_2$ and $S_2.$Top() is $l_4$ and the pair should also be exchanged.
  • Figure 5: $(k-1)$-level (triangular mesh) and $V$ (dashed region on the $x$-$y$ plane) in 3-D, as each weight vector $w$ corresponds to a point $(w_1, w_2)$ on the $x$-$y$ plane. Testing cell intersection is equivalent to finding a point in $V$ whose downward-directed ray from $(w_1, w_2, +\infty)$ (Section \ref{['subsec:small_k']}) hits the cell (triangle). The constraints (points in $V$) can be incorporated into the linear program.
  • ...and 3 more figures

Theorems & Definitions (16)

  • Example 1
  • Definition 1
  • Theorem 1: 3SUM-hardness
  • Remark 1
  • Corollary 1: Curse of Dimensionality
  • Remark 2
  • Theorem 2: $k$-Generalization
  • Remark 3
  • Theorem 3: Group-balancing
  • Corollary 2: Constraint Relaxation
  • ...and 6 more