Table of Contents
Fetching ...

Mitigating Dataset Harms Requires Stewardship: Lessons from 1000 Papers

Kenny Peng, Arunesh Mathur, Arvind Narayanan

TL;DR

This study interrogates how ethical harms from facial/person datasets extend beyond creation into their life cycle, using LFW, MS-Celeb-1M, and DukeMTMC as case studies with nearly 1,000 citing papers. It reveals that retractions fail to fully mitigate harm due to persistent, derivative data and ambiguous licensing, and highlights how derivatives, pre-trained models, and post-processing can introduce new risks. The authors advocate for a distributed, lifecycle-spanning approach to dataset stewardship involving creators, conferences, users, and the broader research community, with concrete recommendations on documentation, licensing, tracking, and ethics reviews. The work provides a pragmatic foundation for policy and practice to curb harms while recognizing the practical realities of data reuse and production use.

Abstract

Machine learning datasets have elicited concerns about privacy, bias, and unethical applications, leading to the retraction of prominent datasets such as DukeMTMC, MS-Celeb-1M, and Tiny Images. In response, the machine learning community has called for higher ethical standards in dataset creation. To help inform these efforts, we studied three influential but ethically problematic face and person recognition datasets -- Labeled Faces in the Wild (LFW), MS-Celeb-1M, and DukeMTM -- by analyzing nearly 1000 papers that cite them. We found that the creation of derivative datasets and models, broader technological and social change, the lack of clarity of licenses, and dataset management practices can introduce a wide range of ethical concerns. We conclude by suggesting a distributed approach to harm mitigation that considers the entire life cycle of a dataset.

Mitigating Dataset Harms Requires Stewardship: Lessons from 1000 Papers

TL;DR

This study interrogates how ethical harms from facial/person datasets extend beyond creation into their life cycle, using LFW, MS-Celeb-1M, and DukeMTMC as case studies with nearly 1,000 citing papers. It reveals that retractions fail to fully mitigate harm due to persistent, derivative data and ambiguous licensing, and highlights how derivatives, pre-trained models, and post-processing can introduce new risks. The authors advocate for a distributed, lifecycle-spanning approach to dataset stewardship involving creators, conferences, users, and the broader research community, with concrete recommendations on documentation, licensing, tracking, and ethics reviews. The work provides a pragmatic foundation for policy and practice to curb harms while recognizing the practical realities of data reuse and production use.

Abstract

Machine learning datasets have elicited concerns about privacy, bias, and unethical applications, leading to the retraction of prominent datasets such as DukeMTMC, MS-Celeb-1M, and Tiny Images. In response, the machine learning community has called for higher ethical standards in dataset creation. To help inform these efforts, we studied three influential but ethically problematic face and person recognition datasets -- Labeled Faces in the Wild (LFW), MS-Celeb-1M, and DukeMTM -- by analyzing nearly 1000 papers that cite them. We found that the creation of derivative datasets and models, broader technological and social change, the lack of clarity of licenses, and dataset management practices can introduce a wide range of ethical concerns. We conclude by suggesting a distributed approach to harm mitigation that considers the entire life cycle of a dataset.

Paper Structure

This paper contains 54 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: The use of DukeMTMC, MS-Celeb-1M, LFW, and their derivatives over time. All three datasets were commonly used through derivatives. DukeMTMC and MS-Celeb-1M were retracted in April 2019, but continued to be used in 2020---largely, through derivatives.
  • Figure 2: A visualization of the rise of the production use of LFW, based on data from LFW's website. By examining versions of the website archived on the Wayback Machine, we identified (approximately) the year in which different results were added. Only 3 of 38 results added before 2014 were commercial but 41 of 49 results after 2016 were commercial.
  • Figure 3: Papers citing associated papers often do not use the associated dataset. The proportion that do varies greatly across different datasets. Here, we include associated papers for which we sampled at least 20 citing papers, and show 95 percent confidence intervals.