Sample Complexity of Bias Detection with Subsampled Point-to-Subspace Distances
German Martinez Matilla, Jakub Marecek
TL;DR
This work tackles the problem of bias detection under sample complexity constraints by reframing bias estimation as a point-to-subspace problem in the space of measures and proposing a subsampling scheme that operates efficiently under the supremum norm. It provides PAC guarantees for recovering whether a test measure lies in a reference subspace, with a VC-dimension-based sample bound that scales polylogarithmically with the number of histogram bins. Empirically, the approach outperforms Wasserstein-based methods in high dimensions and demonstrates robust bias assessment on real-world datasets (Adult and folktables) with protected attribute SEX. The results offer a practically applicable auditing tool for data quality and AI systems, with tunable error and confidence via parameters $(\varepsilon, \delta)$ and potential extensions to broader function classes.
Abstract
Sample complexity of bias estimation is a lower bound on the runtime of any bias detection method. Many regulatory frameworks require the bias to be tested for all subgroups, whose number grows exponentially with the number of protected attributes. Unless one wishes to run a bias detection with a doubly-exponential run-time, one should like to have polynomial complexity of bias detection for a single subgroup. At the same time, the reference data may be based on surveys, and thus come with non-trivial uncertainty. Here, we reformulate bias detection as a point-to-subspace problem on the space of measures and show that, for supremum norm, it can be subsampled efficiently. In particular, our probabilistically approximately correct (PAC) results are corroborated by tests on well-known instances.
