Table of Contents
Fetching ...

Fair Column Subset Selection

Antonis Matakos, Bruno Ordozgoiti, Suhas Thejaswi

TL;DR

FairCSS addresses selecting a common subset of columns to represent two groups by minimizing the maximum reconstruction error relative to each group's best rank-$k$ approximation. The authors adapt leverage-score sampling to the fair setting, prove NP-hardness of minimum-size solutions, and provide a 1.5x-column-count approximation along with rank-revealing QR–based heuristics for scalable performance. Experiments on real-world datasets show that the proposed methods achieve near-fair reconstruction with limited sacrifice to overall accuracy, and a two-stage approach further improves practicality. The work advances fair feature selection by delivering provable guarantees and scalable tools, with future work aiming to extend to more groups and sharpen approximation ratios.

Abstract

The problem of column subset selection asks for a subset of columns from an input matrix such that the matrix can be reconstructed as accurately as possible within the span of the selected columns. A natural extension is to consider a setting where the matrix rows are partitioned into two groups, and the goal is to choose a subset of columns that minimizes the maximum reconstruction error of both groups, relative to their respective best rank-k approximation. Extending the known results of column subset selection to this fair setting is not straightforward: in certain scenarios it is unavoidable to choose columns separately for each group, resulting in double the expected column count. We propose a deterministic leverage-score sampling strategy for the fair setting and show that sampling a column subset of minimum size becomes NP-hard in the presence of two groups. Despite these negative results, we give an approximation algorithm that guarantees a solution within 1.5 times the optimal solution size. We also present practical heuristic algorithms based on rank-revealing QR factorization. Finally, we validate our methods through an extensive set of experiments using real-world data.

Fair Column Subset Selection

TL;DR

FairCSS addresses selecting a common subset of columns to represent two groups by minimizing the maximum reconstruction error relative to each group's best rank- approximation. The authors adapt leverage-score sampling to the fair setting, prove NP-hardness of minimum-size solutions, and provide a 1.5x-column-count approximation along with rank-revealing QR–based heuristics for scalable performance. Experiments on real-world datasets show that the proposed methods achieve near-fair reconstruction with limited sacrifice to overall accuracy, and a two-stage approach further improves practicality. The work advances fair feature selection by delivering provable guarantees and scalable tools, with future work aiming to extend to more groups and sharpen approximation ratios.

Abstract

The problem of column subset selection asks for a subset of columns from an input matrix such that the matrix can be reconstructed as accurately as possible within the span of the selected columns. A natural extension is to consider a setting where the matrix rows are partitioned into two groups, and the goal is to choose a subset of columns that minimizes the maximum reconstruction error of both groups, relative to their respective best rank-k approximation. Extending the known results of column subset selection to this fair setting is not straightforward: in certain scenarios it is unavoidable to choose columns separately for each group, resulting in double the expected column count. We propose a deterministic leverage-score sampling strategy for the fair setting and show that sampling a column subset of minimum size becomes NP-hard in the presence of two groups. Despite these negative results, we give an approximation algorithm that guarantees a solution within 1.5 times the optimal solution size. We also present practical heuristic algorithms based on rank-revealing QR factorization. Finally, we validate our methods through an extensive set of experiments using real-world data.
Paper Structure (11 sections, 3 theorems, 17 equations, 4 figures, 2 tables, 3 algorithms)

This paper contains 11 sections, 3 theorems, 17 equations, 4 figures, 2 tables, 3 algorithms.

Key Result

Theorem 1

Given a matrix $M \in \mathbb R\xspace^{m \times n}$ and an integer $k<\text{rank}(M)$. Let $\theta=k-\epsilon$ for some $\epsilon \in (0,1)$ and $S$ be a subset of column indices such that $\sum_{i \in S} \ell_i^{(k)} \geq \theta$, and $C\in \mathbb R\xspace^{m \times k}$ be the matrix of $M$ forme

Figures (4)

  • Figure 1: MinMaxLoss for different values of $c$ and fixed target rank $k$
  • Figure 2: Comparison of reconstruction error of CSS and FairCSS for groups $A$ and $B$.
  • Figure 3: Price of fairness.
  • Figure 4: Leverage scores of $A$ and $B$ for Table \ref{['table:performance']}

Theorems & Definitions (8)

  • Definition 1: Relative group-wise reconstruction error
  • Definition 2: Leverage scores
  • Theorem 1: papailiopoulos2014provable
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • Definition 3: QR decomposition with column pivoting