Table of Contents
Fetching ...

DMLR: Data-centric Machine Learning Research -- Past, Present and Future

Luis Oala, Manil Maskey, Lilith Bat-Leah, Alicia Parrish, Nezihe Merve Gürel, Tzu-Sheng Kuo, Yang Liu, Rotem Dror, Danilo Brajovic, Xiaozhe Yao, Max Bartolo, William A Gaviria Rojas, Ryan Hileman, Rainier Aliment, Michael W. Mahoney, Meg Risdal, Matthew Lease, Wojciech Samek, Debojyoti Dutta, Curtis G Northcutt, Cody Coleman, Braden Hancock, Bernard Koch, Girmaw Abebe Tadesse, Bojan Karlaš, Ahmed Alaa, Adji Bousso Dieng, Natasha Noy, Vijay Janapa Reddi, James Zou, Praveen Paritosh, Mihaela van der Schaar, Kurt Bollacker, Lora Aroyo, Ce Zhang, Joaquin Vanschoren, Isabelle Guyon, Peter Mattson

TL;DR

The paper introduces DMLR as a data-centric machine learning research program focused on data collection, curation, and infrastructure to advance ML science. It surveys the historical evolution of data-centricity, documents the ICML 2023 workshop as a launching point, and outlines a future ecosystem built around living datasets, data provenance, and governance. Key contributions include promoting data-centric benchmarks (e.g., DataPerf), standardized data formats (Croissant), and the concept of data trusts to enable responsible sharing. The work emphasizes societal impact and calls for ongoing community engagement to sustain datasets, infrastructure, and cross-domain adoption. Overall, it presents a practical roadmap for building a collaborative, open, and governance-aware data-centric ML landscape.

Abstract

Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact.

DMLR: Data-centric Machine Learning Research -- Past, Present and Future

TL;DR

The paper introduces DMLR as a data-centric machine learning research program focused on data collection, curation, and infrastructure to advance ML science. It surveys the historical evolution of data-centricity, documents the ICML 2023 workshop as a launching point, and outlines a future ecosystem built around living datasets, data provenance, and governance. Key contributions include promoting data-centric benchmarks (e.g., DataPerf), standardized data formats (Croissant), and the concept of data trusts to enable responsible sharing. The work emphasizes societal impact and calls for ongoing community engagement to sustain datasets, infrastructure, and cross-domain adoption. Overall, it presents a practical roadmap for building a collaborative, open, and governance-aware data-centric ML landscape.

Abstract

Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact.
Paper Structure (12 sections, 3 figures)

This paper contains 12 sections, 3 figures.

Figures (3)

  • Figure 1: A timeline of some inflection points in the development of data-centric ideas.
  • Figure 2: Themes and contributions from the community at the DMLR ICML 2023 workshop. Top left: LDA of accepted paper abstracts with n_components$=5$. 2-d UMAP of LDA results which are 5-d corresponding to 5 components. Each dot represents an abstract, color coded by the most dominant topic identified by LDA. The topics identified by LDA are displayed alongside as top20 word clouds. Bottom left: A sample of the geographic coordinates of the institutions where authors of accepted works are based. It includes only those locations where the geocode API returns latitude and longitude information for fuzzy search on affiliation names (360 of 495 affiliations returned coordinates, note that not all 495 affiliations are unique). Right: Topics highlighted in the invited talk including prompt-based ML development (Andrew Ng), the DMLR ecosystem (Peter Mattson), reality-centric AI (Mihaela van der Schaar), bias in vision data (Olga Russakovsky and Vikram Ramaswamy), history of distribution shifts dating back to NeurIPS 2006 (Masashi Sugiyama), the AI research agent (Isabelle Guyon), nuances of data quality (Dina Machuve), the DMLR Journal (Ce Zhang) and data-centric LLMs (panel). Links to the full videos and slides of talks are available in \ref{['app:videos']}.
  • Figure 3: An overview of the DMLR ecosystem pillars and community projects.