DMLR: Data-centric Machine Learning Research -- Past, Present and Future
Luis Oala, Manil Maskey, Lilith Bat-Leah, Alicia Parrish, Nezihe Merve Gürel, Tzu-Sheng Kuo, Yang Liu, Rotem Dror, Danilo Brajovic, Xiaozhe Yao, Max Bartolo, William A Gaviria Rojas, Ryan Hileman, Rainier Aliment, Michael W. Mahoney, Meg Risdal, Matthew Lease, Wojciech Samek, Debojyoti Dutta, Curtis G Northcutt, Cody Coleman, Braden Hancock, Bernard Koch, Girmaw Abebe Tadesse, Bojan Karlaš, Ahmed Alaa, Adji Bousso Dieng, Natasha Noy, Vijay Janapa Reddi, James Zou, Praveen Paritosh, Mihaela van der Schaar, Kurt Bollacker, Lora Aroyo, Ce Zhang, Joaquin Vanschoren, Isabelle Guyon, Peter Mattson
TL;DR
The paper introduces DMLR as a data-centric machine learning research program focused on data collection, curation, and infrastructure to advance ML science. It surveys the historical evolution of data-centricity, documents the ICML 2023 workshop as a launching point, and outlines a future ecosystem built around living datasets, data provenance, and governance. Key contributions include promoting data-centric benchmarks (e.g., DataPerf), standardized data formats (Croissant), and the concept of data trusts to enable responsible sharing. The work emphasizes societal impact and calls for ongoing community engagement to sustain datasets, infrastructure, and cross-domain adoption. Overall, it presents a practical roadmap for building a collaborative, open, and governance-aware data-centric ML landscape.
Abstract
Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact.
