Table of Contents
Fetching ...

Access Denied: Meaningful Data Access for Quantitative Algorithm Audits

Juliette Zaccour, Reuben Binns, Luc Rocher

TL;DR

Despite selecting one of the simplest tasks for algorithmic auditing, it is found that data minimization and anonymization practices can strongly increase error rates on individual-level data, leading to unreliable assessments.

Abstract

Independent algorithm audits hold the promise of bringing accountability to automated decision-making. However, third-party audits are often hindered by access restrictions, forcing auditors to rely on limited, low-quality data. To study how these limitations impact research integrity, we conduct audit simulations on two realistic case studies for recidivism and healthcare coverage prediction. We examine the accuracy of estimating group parity metrics across three levels of access: (a) aggregated statistics, (b) individual-level data with model outputs, and (c) individual-level data without model outputs. Despite selecting one of the simplest tasks for algorithmic auditing, we find that data minimization and anonymization practices can strongly increase error rates on individual-level data, leading to unreliable assessments. We discuss implications for independent auditors, as well as potential avenues for HCI researchers and regulators to improve data access and enable both reliable and holistic evaluations.

Access Denied: Meaningful Data Access for Quantitative Algorithm Audits

TL;DR

Despite selecting one of the simplest tasks for algorithmic auditing, it is found that data minimization and anonymization practices can strongly increase error rates on individual-level data, leading to unreliable assessments.

Abstract

Independent algorithm audits hold the promise of bringing accountability to automated decision-making. However, third-party audits are often hindered by access restrictions, forcing auditors to rely on limited, low-quality data. To study how these limitations impact research integrity, we conduct audit simulations on two realistic case studies for recidivism and healthcare coverage prediction. We examine the accuracy of estimating group parity metrics across three levels of access: (a) aggregated statistics, (b) individual-level data with model outputs, and (c) individual-level data without model outputs. Despite selecting one of the simplest tasks for algorithmic auditing, we find that data minimization and anonymization practices can strongly increase error rates on individual-level data, leading to unreliable assessments. We discuss implications for independent auditors, as well as potential avenues for HCI researchers and regulators to improve data access and enable both reliable and holistic evaluations.

Paper Structure

This paper contains 42 sections, 12 figures, 11 tables.

Figures (12)

  • Figure 1: Experimental design flowchart, distinguishing between the simulated organization (in grey), baseline audit (in blue), and audits under Access Scenarios A, B and C (in green).
  • Figure 2: Effect of differential privacy on metric reliability with full sample size and with a reduced ($n=1,000$) sample size. The red and yellow lines represent median values for the baseline audit, while the colored areas represent 95% confidence intervals obtained through bootstrapping and repetitions over dataset splits. Box plots represent metric estimations obtained from the DP aggregates at given privacy budgets, with bootstrapping and repetitions over dataset splits. In the box plots, middle lines represent medians.
  • Figure 3: Effect of sample size on metric reliability for Access B (left) and Access C (right). The red and yellow lines represent median values for baselines. The blue and green dots represent median values for experiments, from 100% to 1% of the available sample in each case. Note that there are less data points on the Access C plots, as we only have 30% of the full audit dataset available under this scenario (the other 70% being used to retrain the model).
  • Figure 4: Effect of missing features on metric reliability for Access B (left) and Access C (right). On the x-axis, features are ordered by increasing order of importance. Plots are cumulative (e.g. at the F14 point, features 14 to 18 are missing from the audit dataset). The red and yellow lines represent median values for baselines, while the blue and green dots represent median values for experiments. The error bounds represent 95% confidence intervals obtained through bootstrapping and repetitions over dataset splits.
  • Figure 5: Effect of disparate missing values rates on metric reliability for Access B (left) and Access C (right). The red and yellow lines represent median values for baselines, while the blue and green dots represent median values for experiments. The colored areas represent 95% confidence intervals obtained through bootstrapping and repetitions over dataset splits.
  • ...and 7 more figures