Table of Contents
Fetching ...

On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms

Surbhi Mittal, Kartik Thakral, Richa Singh, Mayank Vatsa, Tamar Glaser, Cristian Canton Ferrer, Tal Hassner

TL;DR

This study introduces a data-centric framework for Responsible Machine Learning Datasets that evaluates datasets along fairness, privacy, and regulatory compliance. By quantifying these axes with explicit metrics and applying them to 60 datasets (primarily biometric faces and chest X-rays), the authors reveal widespread deficiencies across all three dimensions and advocate for enhanced dataset documentation through modified datasheets. The work highlights a fairness–privacy paradox and demonstrates that regulatory compliance is often the weakest link in dataset quality, underscoring the need for governance-aware data collection practices. The proposed approach provides a North Star for constructing responsible datasets, enabling researchers and policymakers to improve data stewardship and, consequently, AI trustworthiness in practice.

Abstract

Artificial Intelligence (AI) has made its way into various scientific fields, providing astonishing improvements over existing algorithms for a wide variety of tasks. In recent years, there have been severe concerns over the trustworthiness of AI technologies. The scientific community has focused on the development of trustworthy AI algorithms. However, machine and deep learning algorithms, popular in the AI community today, depend heavily on the data used during their development. These learning algorithms identify patterns in the data, learning the behavioral objective. Any flaws in the data have the potential to translate directly into algorithms. In this study, we discuss the importance of Responsible Machine Learning Datasets and propose a framework to evaluate the datasets through a responsible rubric. While existing work focuses on the post-hoc evaluation of algorithms for their trustworthiness, we provide a framework that considers the data component separately to understand its role in the algorithm. We discuss responsible datasets through the lens of fairness, privacy, and regulatory compliance and provide recommendations for constructing future datasets. After surveying over 100 datasets, we use 60 datasets for analysis and demonstrate that none of these datasets is immune to issues of fairness, privacy preservation, and regulatory compliance. We provide modifications to the ``datasheets for datasets" with important additions for improved dataset documentation. With governments around the world regularizing data protection laws, the method for the creation of datasets in the scientific community requires revision. We believe this study is timely and relevant in today's era of AI.

On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms

TL;DR

This study introduces a data-centric framework for Responsible Machine Learning Datasets that evaluates datasets along fairness, privacy, and regulatory compliance. By quantifying these axes with explicit metrics and applying them to 60 datasets (primarily biometric faces and chest X-rays), the authors reveal widespread deficiencies across all three dimensions and advocate for enhanced dataset documentation through modified datasheets. The work highlights a fairness–privacy paradox and demonstrates that regulatory compliance is often the weakest link in dataset quality, underscoring the need for governance-aware data collection practices. The proposed approach provides a North Star for constructing responsible datasets, enabling researchers and policymakers to improve data stewardship and, consequently, AI trustworthiness in practice.

Abstract

Artificial Intelligence (AI) has made its way into various scientific fields, providing astonishing improvements over existing algorithms for a wide variety of tasks. In recent years, there have been severe concerns over the trustworthiness of AI technologies. The scientific community has focused on the development of trustworthy AI algorithms. However, machine and deep learning algorithms, popular in the AI community today, depend heavily on the data used during their development. These learning algorithms identify patterns in the data, learning the behavioral objective. Any flaws in the data have the potential to translate directly into algorithms. In this study, we discuss the importance of Responsible Machine Learning Datasets and propose a framework to evaluate the datasets through a responsible rubric. While existing work focuses on the post-hoc evaluation of algorithms for their trustworthiness, we provide a framework that considers the data component separately to understand its role in the algorithm. We discuss responsible datasets through the lens of fairness, privacy, and regulatory compliance and provide recommendations for constructing future datasets. After surveying over 100 datasets, we use 60 datasets for analysis and demonstrate that none of these datasets is immune to issues of fairness, privacy preservation, and regulatory compliance. We provide modifications to the ``datasheets for datasets" with important additions for improved dataset documentation. With governments around the world regularizing data protection laws, the method for the creation of datasets in the scientific community requires revision. We believe this study is timely and relevant in today's era of AI.
Paper Structure (9 sections, 6 equations, 5 figures, 8 tables)

This paper contains 9 sections, 6 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: We introduce the concept of Responsible Machine Learning Datasets and propose a quantitative rubric along with recommendations for future datasets.
  • Figure 2: (Top) The three aspects involved in fairness quantification- Inclusivity, Diversity, and Labels, and the questions they answer. (Bottom) The formulation employed for the calculation of the fairness score.
  • Figure 3: Privacy leakage through the information available in datasets. The sample is representative of information present in datasets such as the LFW dataset huang2008labeledliu2015deep.
  • Figure 4: The summary of fairness, privacy, and regulatory compliance scores through histogram visualization for the datasets we surveyed. (Left) The maximum value of the fairness score that can be obtained is 5, but it is observed that the fairness scores do not exceed a value of 3. (Middle) While most datasets in our study preserve privacy in terms of not leaking location or medical information, very few provide perfect privacy preservation. (Right) Most datasets comply with no regulatory norm or only one. We can observe from this plot that most datasets provide a low fairness score and perform poorly on the regulatory compliance metric.
  • Figure 5: Cluster analysis based on the 3-tuple quantification of fairness, privacy, and regulatory compliance for (a-b) only face-based datasets and (c-d) jointly with medical datasets. (a, c) The 3-D scatter plot of the different datasets across the three axes with the FPR dataset plotted with perfect fairness, privacy preservation, and regulatory compliance. (b, d) The scatter plot after performing DBSCAN clustering with $eps=1$. We observe that the FB Fairness Dataset and the UTKFace dataset lie the closest to the FPR dataset.