Table of Contents
Fetching ...

On Socially Fair Low-Rank Approximation and Column Subset Selection

Zhao Song, Ali Vakilian, David P. Woodruff, Samson Zhou

TL;DR

This work studies socially fair low-rank approximation and socially fair column subset selection, aiming to minimize the worst-case reconstruction loss across multiple demographic groups. It establishes strong hardness results showing that constant-factor fair LRA is intractable under standard complexity assumptions, while offering practical, scalable alternatives for a fixed number of groups via $2^{\mathrm{poly}(k)}$-time algorithms and poly-time bicriteria methods. The paper develops a suite of techniques—affine embeddings, leverage-score/Lewis-weight sampling, and DVoretzky-type embeddings—to achieve near-optimal fair reconstructions and to select informative column subsets in a fairness-aware manner, with rigorous guarantees. Empirical evaluations on real (credit-card) and synthetic data validate the effectiveness of the proposed bicriteria algorithms, demonstrating improved fairness-sensitive objective values and favorable runtimes compared with traditional non-fair baselines. Overall, the results provide both theoretical limits and practical tools for integrating fairness into core linear-algebra tasks used in machine learning and data analysis.

Abstract

Low-rank approximation and column subset selection are two fundamental and related problems that are applied across a wealth of machine learning applications. In this paper, we study the question of socially fair low-rank approximation and socially fair column subset selection, where the goal is to minimize the loss over all sub-populations of the data. We show that surprisingly, even constant-factor approximation to fair low-rank approximation requires exponential time under certain standard complexity hypotheses. On the positive side, we give an algorithm for fair low-rank approximation that, for a constant number of groups and constant-factor accuracy, runs in $2^{\text{poly}(k)}$ time rather than the naïve $n^{\text{poly}(k)}$, which is a substantial improvement when the dataset has a large number $n$ of observations. We then show that there exist bicriteria approximation algorithms for fair low-rank approximation and fair column subset selection that run in polynomial time.

On Socially Fair Low-Rank Approximation and Column Subset Selection

TL;DR

This work studies socially fair low-rank approximation and socially fair column subset selection, aiming to minimize the worst-case reconstruction loss across multiple demographic groups. It establishes strong hardness results showing that constant-factor fair LRA is intractable under standard complexity assumptions, while offering practical, scalable alternatives for a fixed number of groups via -time algorithms and poly-time bicriteria methods. The paper develops a suite of techniques—affine embeddings, leverage-score/Lewis-weight sampling, and DVoretzky-type embeddings—to achieve near-optimal fair reconstructions and to select informative column subsets in a fairness-aware manner, with rigorous guarantees. Empirical evaluations on real (credit-card) and synthetic data validate the effectiveness of the proposed bicriteria algorithms, demonstrating improved fairness-sensitive objective values and favorable runtimes compared with traditional non-fair baselines. Overall, the results provide both theoretical limits and practical tools for integrating fairness into core linear-algebra tasks used in machine learning and data analysis.

Abstract

Low-rank approximation and column subset selection are two fundamental and related problems that are applied across a wealth of machine learning applications. In this paper, we study the question of socially fair low-rank approximation and socially fair column subset selection, where the goal is to minimize the loss over all sub-populations of the data. We show that surprisingly, even constant-factor approximation to fair low-rank approximation requires exponential time under certain standard complexity hypotheses. On the positive side, we give an algorithm for fair low-rank approximation that, for a constant number of groups and constant-factor accuracy, runs in time rather than the naïve , which is a substantial improvement when the dataset has a large number of observations. We then show that there exist bicriteria approximation algorithms for fair low-rank approximation and fair column subset selection that run in polynomial time.

Paper Structure

This paper contains 30 sections, 34 theorems, 87 equations, 2 figures, 6 algorithms.

Key Result

Theorem 1.1

Fair low-rank approximation is NP-hard to approximate within any constant factor.

Figures (2)

  • Figure 1: Empirical evaluations on the Default Credit dataset.
  • Figure 2: Ratio of the cost of our bicriteria algorithm to the cost of the standard low-rank approximation solution for $k=2$, across 100 iterations.

Theorems & Definitions (53)

  • Theorem 1.1
  • Theorem 1.2
  • Theorem 1.3
  • Theorem 1.4
  • Theorem 1.5
  • Definition 1.6: Subspace embedding
  • Definition 1.7
  • Theorem 1.8: Generalization of Foster's Theorem, foster1953stochastic
  • Theorem 1.9: Leverage score sampling
  • Lemma 1.9
  • ...and 43 more