A Data-Centric Approach for Training Deep Neural Networks with Less Data
Mohammad Motamedi, Nikolay Sakharnykh, Tim Kaldewey
TL;DR
The paper presents a data-centric pipeline to train deep networks with limited data by cleaning and augmenting existing samples and by synthesizing new edge-case data with a conditional GAN. It demonstrates a 5% accuracy improvement and a 1.54× reduction in dataset size on Roman-MNIST (DCAIC) relative to a baseline, with an additional ~1% gain from GAN-generated data. The approach combines duplicate removal, human-in-the-loop validation, class balancing, and cross-validation, plus a GAN component that leverages a truncated ResNet50 classifier to produce challenging samples. Together, these strategies offer a practical pathway to higher performance under data constraints and broader accessibility for data-scarce applications.
Abstract
While the availability of large datasets is perceived to be a key requirement for training deep neural networks, it is possible to train such models with relatively little data. However, compensating for the absence of large datasets demands a series of actions to enhance the quality of the existing samples and to generate new ones. This paper summarizes our winning submission to the "Data-Centric AI" competition. We discuss some of the challenges that arise while training with a small dataset, offer a principled approach for systematic data quality enhancement, and propose a GAN-based solution for synthesizing new data points. Our evaluations indicate that the dataset generated by the proposed pipeline offers 5% accuracy improvement while being significantly smaller than the baseline.
