Table of Contents
Fetching ...

Sample Complexity of Bias Detection with Subsampled Point-to-Subspace Distances

German Martinez Matilla, Jakub Marecek

TL;DR

This work tackles the problem of bias detection under sample complexity constraints by reframing bias estimation as a point-to-subspace problem in the space of measures and proposing a subsampling scheme that operates efficiently under the supremum norm. It provides PAC guarantees for recovering whether a test measure lies in a reference subspace, with a VC-dimension-based sample bound that scales polylogarithmically with the number of histogram bins. Empirically, the approach outperforms Wasserstein-based methods in high dimensions and demonstrates robust bias assessment on real-world datasets (Adult and folktables) with protected attribute SEX. The results offer a practically applicable auditing tool for data quality and AI systems, with tunable error and confidence via parameters $(\varepsilon, \delta)$ and potential extensions to broader function classes.

Abstract

Sample complexity of bias estimation is a lower bound on the runtime of any bias detection method. Many regulatory frameworks require the bias to be tested for all subgroups, whose number grows exponentially with the number of protected attributes. Unless one wishes to run a bias detection with a doubly-exponential run-time, one should like to have polynomial complexity of bias detection for a single subgroup. At the same time, the reference data may be based on surveys, and thus come with non-trivial uncertainty. Here, we reformulate bias detection as a point-to-subspace problem on the space of measures and show that, for supremum norm, it can be subsampled efficiently. In particular, our probabilistically approximately correct (PAC) results are corroborated by tests on well-known instances.

Sample Complexity of Bias Detection with Subsampled Point-to-Subspace Distances

TL;DR

This work tackles the problem of bias detection under sample complexity constraints by reframing bias estimation as a point-to-subspace problem in the space of measures and proposing a subsampling scheme that operates efficiently under the supremum norm. It provides PAC guarantees for recovering whether a test measure lies in a reference subspace, with a VC-dimension-based sample bound that scales polylogarithmically with the number of histogram bins. Empirically, the approach outperforms Wasserstein-based methods in high dimensions and demonstrates robust bias assessment on real-world datasets (Adult and folktables) with protected attribute SEX. The results offer a practically applicable auditing tool for data quality and AI systems, with tunable error and confidence via parameters and potential extensions to broader function classes.

Abstract

Sample complexity of bias estimation is a lower bound on the runtime of any bias detection method. Many regulatory frameworks require the bias to be tested for all subgroups, whose number grows exponentially with the number of protected attributes. Unless one wishes to run a bias detection with a doubly-exponential run-time, one should like to have polynomial complexity of bias detection for a single subgroup. At the same time, the reference data may be based on surveys, and thus come with non-trivial uncertainty. Here, we reformulate bias detection as a point-to-subspace problem on the space of measures and show that, for supremum norm, it can be subsampled efficiently. In particular, our probabilistically approximately correct (PAC) results are corroborated by tests on well-known instances.

Paper Structure

This paper contains 12 sections, 1 theorem, 27 equations, 6 figures, 2 algorithms.

Key Result

Theorem 8

Consider a test measure $\alpha^0\in\mathcal{M}^1_+(\mathbb{R}^n),$ and a subspace $V \in \mathcal{M}^1_+(\mathbb{R}^n)$, such that $\ell_0(v_{hist}(\alpha)) \le \epsilon N$, where $v_{hist}(\alpha)$ is the vector of violations of the constraint $v(\alpha) := \{ \max\{ |\alpha^0(x_i)-a_i| - \Delta, produces a false positive, that is, reports that the test measure, $\alpha^0,$ is in $S\subset\math

Figures (6)

  • Figure 1: Whole COMPAS dataset by decile_score
  • Figure 2: Whole population approximation by decile_score
  • Figure 3: Test measure, $\alpha_0$
  • Figure 4: decile_score and age relative frequencies with error bars of length $\Delta$
  • Figure 5: Probability of one-sided error for Wasserstein-2 and point-to-subspace distance in the supremum norm as a function of the sample size on the Adult dataset misc_adult_2.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Definition 1: Reproducing kernel
  • Definition 2: MMD in RKHS
  • Example 3
  • Definition 4
  • Definition 5: MR3408730
  • Definition 6: MR3408730
  • Definition 7: MR3408730
  • Theorem 8
  • proof