Fairness Auditing with Multi-Agent Collaboration
Martijn de Vos, Akash Dhasade, Jade Garcia Bourrée, Anne-Marie Kermarrec, Erwan Le Merrer, Benoit Rottembourg, Gilles Tredan
TL;DR
The paper explores fairness auditing with multiple collaborating agents, each auditing a protected attribute under a fixed query budget. It analyzes how sampling methods (uniform, stratified, Neyman) interact with collaboration modes (no collaboration, a-posteriori, a-priori) to affect the accuracy of demographic parity estimates. Theoretical results show that collaboration generally improves audit accuracy, but extensive a-priori coordination with stratified sampling can be detrimental as the number of agents grows; under a-posteriori collaboration, advanced sampling methods offer diminishing returns with more agents, making uniform sampling effectively optimal at scale. Empirical validation on Folktables, German Credit, and ProPublica datasets confirms the theory, demonstrating substantial DP error reductions via collaboration and clarifying when each strategy is advantageous. The work provides practical guidance for coordinating fairness audits in real-world ML systems and outlines avenues for extending to intersectional and active-sampling settings.
Abstract
Existing work in fairness auditing assumes that each audit is performed independently. In this paper, we consider multiple agents working together, each auditing the same platform for different tasks. Agents have two levers: their collaboration strategy, with or without coordination beforehand, and their strategy for sampling appropriate data points. We theoretically compare the interplay of these levers. Our main findings are that (i) collaboration is generally beneficial for accurate audits, (ii) basic sampling methods often prove to be effective, and (iii) counter-intuitively, extensive coordination on queries often deteriorates audits accuracy as the number of agents increases. Experiments on three large datasets confirm our theoretical results. Our findings motivate collaboration during fairness audits of platforms that use ML models for decision-making.
