Table of Contents
Fetching ...

A Data-Centric Approach for Training Deep Neural Networks with Less Data

Mohammad Motamedi, Nikolay Sakharnykh, Tim Kaldewey

TL;DR

The paper presents a data-centric pipeline to train deep networks with limited data by cleaning and augmenting existing samples and by synthesizing new edge-case data with a conditional GAN. It demonstrates a 5% accuracy improvement and a 1.54× reduction in dataset size on Roman-MNIST (DCAIC) relative to a baseline, with an additional ~1% gain from GAN-generated data. The approach combines duplicate removal, human-in-the-loop validation, class balancing, and cross-validation, plus a GAN component that leverages a truncated ResNet50 classifier to produce challenging samples. Together, these strategies offer a practical pathway to higher performance under data constraints and broader accessibility for data-scarce applications.

Abstract

While the availability of large datasets is perceived to be a key requirement for training deep neural networks, it is possible to train such models with relatively little data. However, compensating for the absence of large datasets demands a series of actions to enhance the quality of the existing samples and to generate new ones. This paper summarizes our winning submission to the "Data-Centric AI" competition. We discuss some of the challenges that arise while training with a small dataset, offer a principled approach for systematic data quality enhancement, and propose a GAN-based solution for synthesizing new data points. Our evaluations indicate that the dataset generated by the proposed pipeline offers 5% accuracy improvement while being significantly smaller than the baseline.

A Data-Centric Approach for Training Deep Neural Networks with Less Data

TL;DR

The paper presents a data-centric pipeline to train deep networks with limited data by cleaning and augmenting existing samples and by synthesizing new edge-case data with a conditional GAN. It demonstrates a 5% accuracy improvement and a 1.54× reduction in dataset size on Roman-MNIST (DCAIC) relative to a baseline, with an additional ~1% gain from GAN-generated data. The approach combines duplicate removal, human-in-the-loop validation, class balancing, and cross-validation, plus a GAN component that leverages a truncated ResNet50 classifier to produce challenging samples. Together, these strategies offer a practical pathway to higher performance under data constraints and broader accessibility for data-scarce applications.

Abstract

While the availability of large datasets is perceived to be a key requirement for training deep neural networks, it is possible to train such models with relatively little data. However, compensating for the absence of large datasets demands a series of actions to enhance the quality of the existing samples and to generate new ones. This paper summarizes our winning submission to the "Data-Centric AI" competition. We discuss some of the challenges that arise while training with a small dataset, offer a principled approach for systematic data quality enhancement, and propose a GAN-based solution for synthesizing new data points. Our evaluations indicate that the dataset generated by the proposed pipeline offers 5% accuracy improvement while being significantly smaller than the baseline.

Paper Structure

This paper contains 9 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: (a) Samples from the initial dataset ng. (b) Samples of augmented images used for training the auxiliary models. (c) Samples of false positives instances detected by the auxiliary models. (d) Samples with high loss values flagged for review by the auxiliary neural networks. (e) Samples of additional handwritten data gathered for this research. (f) Additional data generated by the GAN.
  • Figure 2: Proposed GAN comprising of three components: A generator, a discriminator, and a pre-trained ResNet50-based classifier that is truncated from the "conv2_block3_out" layer.