Mitigating Dataset Harms Requires Stewardship: Lessons from 1000 Papers
Kenny Peng, Arunesh Mathur, Arvind Narayanan
TL;DR
This study interrogates how ethical harms from facial/person datasets extend beyond creation into their life cycle, using LFW, MS-Celeb-1M, and DukeMTMC as case studies with nearly 1,000 citing papers. It reveals that retractions fail to fully mitigate harm due to persistent, derivative data and ambiguous licensing, and highlights how derivatives, pre-trained models, and post-processing can introduce new risks. The authors advocate for a distributed, lifecycle-spanning approach to dataset stewardship involving creators, conferences, users, and the broader research community, with concrete recommendations on documentation, licensing, tracking, and ethics reviews. The work provides a pragmatic foundation for policy and practice to curb harms while recognizing the practical realities of data reuse and production use.
Abstract
Machine learning datasets have elicited concerns about privacy, bias, and unethical applications, leading to the retraction of prominent datasets such as DukeMTMC, MS-Celeb-1M, and Tiny Images. In response, the machine learning community has called for higher ethical standards in dataset creation. To help inform these efforts, we studied three influential but ethically problematic face and person recognition datasets -- Labeled Faces in the Wild (LFW), MS-Celeb-1M, and DukeMTM -- by analyzing nearly 1000 papers that cite them. We found that the creation of derivative datasets and models, broader technological and social change, the lack of clarity of licenses, and dataset management practices can introduce a wide range of ethical concerns. We conclude by suggesting a distributed approach to harm mitigation that considers the entire life cycle of a dataset.
