AI Competitions and Benchmarks: Dataset Development

Romain Egele; Julio C. S. Jacques Junior; Jan N. van Rijn; Isabelle Guyon; Xavier Baró; Albert Clapés; Prasanna Balaprakash; Sergio Escalera; Thomas Moeslund; Jun Wan

AI Competitions and Benchmarks: Dataset Development

Romain Egele, Julio C. S. Jacques Junior, Jan N. van Rijn, Isabelle Guyon, Xavier Baró, Albert Clapés, Prasanna Balaprakash, Sergio Escalera, Thomas Moeslund, Jun Wan

TL;DR

This chapter argues that data preparation is the bottleneck in practical ML and presents a principled, agile dataset development lifecycle comprising requirements, design, implementation, evaluation, and distribution/maintenance. It provides a comprehensive framework spanning documentation standards, ethical and regulatory considerations, and detailed implementation guidance for data collection (gathering, synthesis, acquisition, annotation) and transformation (integration, cleaning, reduction, representation, normalization/calibration, augmentation). Key contributions include a structured taxonomy for dataset development activities, emphasis on transparent documentation, and guidelines for evaluating datasets with regard to soundness, completeness, fairness, and privacy. The framework aims to improve trust, robustness, and reusability of datasets in real-world AI applications, addressing challenges from bias and privacy to versioning and maintenance. Overall, it offers practical methods to design, build, and sustain high-quality datasets that underpin reliable benchmarks and AI systems.

Abstract

Machine learning is now used in many applications thanks to its ability to predict, generate, or discover patterns from large quantities of data. However, the process of collecting and transforming data for practical use is intricate. Even in today's digital era, where substantial data is generated daily, it is uncommon for it to be readily usable; most often, it necessitates meticulous manual data preparation. The haste in developing new models can frequently result in various shortcomings, potentially posing risks when deployed in real-world scenarios (eg social discrimination, critical failures), leading to the failure or substantial escalation of costs in AI-based projects. This chapter provides a comprehensive overview of established methodological tools, enriched by our practical experience, in the development of datasets for machine learning. Initially, we develop the tasks involved in dataset development and offer insights into their effective management (including requirements, design, implementation, evaluation, distribution, and maintenance). Then, we provide more details about the implementation process which includes data collection, transformation, and quality evaluation. Finally, we address practical considerations regarding dataset distribution and maintenance.

AI Competitions and Benchmarks: Dataset Development

TL;DR

Abstract

Paper Structure (28 sections, 6 figures)

This paper contains 28 sections, 6 figures.

Introduction
Documentation
Requirements
Design
De Novo Data
Reusing, Repurposing, and Recyling Data
Implementation
Data Collection
Gathering
Synthesis and Generation
Acquisition
Annotation
Data Transformation
Integration and Fusion
Cleaning
...and 13 more sections

Figures (6)

Figure 1: The dataset development cycle.
Figure 2: Categorization of sub-tasks included in data collection and transformation. Collection operators (blue) take as input a design and as output a dataset (of an arbitrary size). Transformation operators (yellow) require as input a dataset and will have as output also a dataset. Some operators (green) fall into both categories. A typical data development process combines several of these operators into a pipeline, always starting with a collection operator.
Figure 3: An example flow chart diagram of a data implementation pipeline that creates a dataset for handwritten digits classification.
Figure 4: Self-Supervised Learning through Contrastive (purple) or Non-Contrastive (orange) Learning. The input data is $x$ and the representation learned is $z$ for both.
Figure 5: Categorization of sub-processes included in data evaluation.
...and 1 more figures

AI Competitions and Benchmarks: Dataset Development

TL;DR

Abstract

AI Competitions and Benchmarks: Dataset Development

Authors

TL;DR

Abstract

Table of Contents

Figures (6)