Table of Contents
Fetching ...

Building Better Datasets: Seven Recommendations for Responsible Design from Dataset Creators

Will Orr, Kate Crawford

TL;DR

The paper tackles the problem of ensuring responsible dataset creation in ML by grounding insights in qualitative interviews with 18 dataset creators across domains. It identifies seven actionable recommendations spanning data quality, diversity, documentation, openness, user-centric design, privacy, consent, and iterative development, highlighting how creator perspectives have been underrepresented in the literature. The study emphasizes the fragmentation and undervaluation of data work, calling for professionalization and greater cross-domain collaboration to foster better practices. Overall, it argues that strengthening the ethical and technical dimensions of dataset creation is essential for responsible, robust, and accessible ML systems.

Abstract

The increasing demand for high-quality datasets in machine learning has raised concerns about the ethical and responsible creation of these datasets. Dataset creators play a crucial role in developing responsible practices, yet their perspectives and expertise have not yet been highlighted in the current literature. In this paper, we bridge this gap by presenting insights from a qualitative study that included interviewing 18 leading dataset creators about the current state of the field. We shed light on the challenges and considerations faced by dataset creators, and our findings underscore the potential for deeper collaboration, knowledge sharing, and collective development. Through a close analysis of their perspectives, we share seven central recommendations for improving responsible dataset creation, including issues such as data quality, documentation, privacy and consent, and how to mitigate potential harms from unintended use cases. By fostering critical reflection and sharing the experiences of dataset creators, we aim to promote responsible dataset creation practices and develop a nuanced understanding of this crucial but often undervalued aspect of machine learning research.

Building Better Datasets: Seven Recommendations for Responsible Design from Dataset Creators

TL;DR

The paper tackles the problem of ensuring responsible dataset creation in ML by grounding insights in qualitative interviews with 18 dataset creators across domains. It identifies seven actionable recommendations spanning data quality, diversity, documentation, openness, user-centric design, privacy, consent, and iterative development, highlighting how creator perspectives have been underrepresented in the literature. The study emphasizes the fragmentation and undervaluation of data work, calling for professionalization and greater cross-domain collaboration to foster better practices. Overall, it argues that strengthening the ethical and technical dimensions of dataset creation is essential for responsible, robust, and accessible ML systems.

Abstract

The increasing demand for high-quality datasets in machine learning has raised concerns about the ethical and responsible creation of these datasets. Dataset creators play a crucial role in developing responsible practices, yet their perspectives and expertise have not yet been highlighted in the current literature. In this paper, we bridge this gap by presenting insights from a qualitative study that included interviewing 18 leading dataset creators about the current state of the field. We shed light on the challenges and considerations faced by dataset creators, and our findings underscore the potential for deeper collaboration, knowledge sharing, and collective development. Through a close analysis of their perspectives, we share seven central recommendations for improving responsible dataset creation, including issues such as data quality, documentation, privacy and consent, and how to mitigate potential harms from unintended use cases. By fostering critical reflection and sharing the experiences of dataset creators, we aim to promote responsible dataset creation practices and develop a nuanced understanding of this crucial but often undervalued aspect of machine learning research.
Paper Structure (42 sections, 1 figure, 1 table)