Croissant: A Metadata Format for ML-Ready Datasets

Mubashara Akhtar; Omar Benjelloun; Costanza Conforti; Luca Foschini; Joan Giner-Miguelez; Pieter Gijsbers; Sujata Goswami; Nitisha Jain; Michalis Karamousadakis; Michael Kuchnik; Satyapriya Krishna; Sylvain Lesage; Quentin Lhoest; Pierre Marcenac; Manil Maskey; Peter Mattson; Luis Oala; Hamidah Oderinwale; Pierre Ruyssen; Tim Santos; Rajat Shinde; Elena Simperl; Arjun Suresh; Goeffry Thomas; Slava Tykhonov; Joaquin Vanschoren; Susheel Varma; Jos van der Velde; Steffen Vogler; Carole-Jean Wu; Luyao Zhang

Croissant: A Metadata Format for ML-Ready Datasets

Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Luca Foschini, Joan Giner-Miguelez, Pieter Gijsbers, Sujata Goswami, Nitisha Jain, Michalis Karamousadakis, Michael Kuchnik, Satyapriya Krishna, Sylvain Lesage, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Hamidah Oderinwale, Pierre Ruyssen, Tim Santos, Rajat Shinde, Elena Simperl, Arjun Suresh, Goeffry Thomas, Slava Tykhonov, Joaquin Vanschoren, Susheel Varma, Jos van der Velde, Steffen Vogler, Carole-Jean Wu, Luyao Zhang

TL;DR

Croissant introduces a machine-readable metadata format to reduce friction in ML data workflows by providing a structured, multi-layer description of datasets. Built on Schema.org, the four-layer Croissant model (Dataset Metadata, Resources, Structure, Semantic) plus the Croissant-RAI extension enables ML-ready descriptions that can be loaded directly into popular frameworks and repositories, without altering underlying data. A human-centered user study across language, vision, audio, and multi-modal datasets shows Croissant metadata is readable, complete, and concise, with strong adoption in major repositories such as Hugging Face Datasets, Kaggle, and OpenML. The work also delivers tooling (mlcroissant, TFDS integration, Croissant Editor) and governance via an open Croissant Working Group to drive broad adoption, semantic search, and responsible AI documentation in the ML ecosystem.

Abstract

Data is a critical resource for machine learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that creates a shared representation across ML tools, frameworks, and platforms. Croissant makes datasets more discoverable, portable, and interoperable, thereby addressing significant challenges in ML data management. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, enabling easy loading into the most commonly-used ML frameworks, regardless of where the data is stored. Our initial evaluation by human raters shows that Croissant metadata is readable, understandable, complete, yet concise.

Croissant: A Metadata Format for ML-Ready Datasets

TL;DR

Abstract

Paper Structure (38 sections, 21 figures, 5 tables)

This paper contains 38 sections, 21 figures, 5 tables.

Introduction
Related Work
The Croissant Format
The Dataset Metadata Layer
The Resources Layer
The Structure Layer
The Semantic Layer
The Croissant-RAI Extension
Croissant Tools and Integrations
Data Repositories.
ML Frameworks.
The Croissant Working Group
Croissant Evaluation: A User Study with ML Practitioners
The User Study Process
Recruitment of Annotators and Annotation Process.
...and 23 more sections

Figures (21)

Figure 1: The Croissant lifecycle and ecosystem.
Figure 2: Users can easily inspect datasets (e.g., Fashion MNIST fashion-mnist) and use them in data loaders with Croissant. See Supplementary material or visit https://github.com/mlcommons/croissant for more examples.
Figure 3: Dataset metadata and resources for the PASS dataset.
Figure 4: A RecordSet that joins images and structured metadata from the PASS dataset.
Figure 5: Answers to the completeness question.
...and 16 more figures

Croissant: A Metadata Format for ML-Ready Datasets

TL;DR

Abstract

Croissant: A Metadata Format for ML-Ready Datasets

Authors

TL;DR

Abstract

Table of Contents

Figures (21)