Table of Contents
Fetching ...

Data augmentation with automated machine learning: approaches and performance comparison with classical data augmentation methods

Alhassan Mumuni, Fuseini Mumuni

TL;DR

This paper tackles the problem of manually designing data augmentation policies, which is labor-intensive and dataset-specific, by surveying AutoML-based augmentation approaches. It categorizes methods into data manipulation, data integration, and data synthesis, and analyzes three core subtasks: search-space design, policy optimization, and evaluation. The authors provide extensive quantitative comparisons showing AutoML-based augmentations generally outperform state-of-the-art classical methods on benchmarks like CIFAR-10/100 and ImageNet, while noting substantial computational demands. They also discuss practical trade-offs, the potential for combining AutoML with classical augmentations, and open research directions such as efficiency, imbalance handling, and instance-specific transformations. The work highlights AutoML as a promising direction for robust, transferable data augmentation, with future opportunities leveraging larger models and smarter search strategies to reduce cost and widen applicability.

Abstract

Data augmentation is arguably the most important regularization technique commonly used to improve generalization performance of machine learning models. It primarily involves the application of appropriate data transformation operations to create new data samples with desired properties. Despite its effectiveness, the process is often challenging because of the time-consuming trial and error procedures for creating and testing different candidate augmentations and their hyperparameters manually. State-of-the-art approaches are increasingly relying on automated machine learning (AutoML) principles. This work presents a comprehensive survey of AutoML-based data augmentation techniques. We discuss various approaches for accomplishing data augmentation with AutoML, including data manipulation, data integration and data synthesis techniques. The focus of this work is on image data augmentation methods. Nonetheless, we cover other data modalities, especially in cases where the specific data augmentations techniques being discussed are more suitable for these other modalities. For instance, since automated data integration methods are more suitable for tabular data, we cover tabular data in the discussion of data integration methods. The work also presents extensive discussion of techniques for accomplishing each of the major subtasks of the image data augmentation process: search space design, hyperparameter optimization and model evaluation. Finally, we carried out an extensive comparison and analysis of the performance of automated data augmentation techniques and state-of-the-art methods based on classical augmentation approaches. The results show that AutoML methods for data augmentation currently outperform state-of-the-art techniques based on conventional approaches.

Data augmentation with automated machine learning: approaches and performance comparison with classical data augmentation methods

TL;DR

This paper tackles the problem of manually designing data augmentation policies, which is labor-intensive and dataset-specific, by surveying AutoML-based augmentation approaches. It categorizes methods into data manipulation, data integration, and data synthesis, and analyzes three core subtasks: search-space design, policy optimization, and evaluation. The authors provide extensive quantitative comparisons showing AutoML-based augmentations generally outperform state-of-the-art classical methods on benchmarks like CIFAR-10/100 and ImageNet, while noting substantial computational demands. They also discuss practical trade-offs, the potential for combining AutoML with classical augmentations, and open research directions such as efficiency, imbalance handling, and instance-specific transformations. The work highlights AutoML as a promising direction for robust, transferable data augmentation, with future opportunities leveraging larger models and smarter search strategies to reduce cost and widen applicability.

Abstract

Data augmentation is arguably the most important regularization technique commonly used to improve generalization performance of machine learning models. It primarily involves the application of appropriate data transformation operations to create new data samples with desired properties. Despite its effectiveness, the process is often challenging because of the time-consuming trial and error procedures for creating and testing different candidate augmentations and their hyperparameters manually. State-of-the-art approaches are increasingly relying on automated machine learning (AutoML) principles. This work presents a comprehensive survey of AutoML-based data augmentation techniques. We discuss various approaches for accomplishing data augmentation with AutoML, including data manipulation, data integration and data synthesis techniques. The focus of this work is on image data augmentation methods. Nonetheless, we cover other data modalities, especially in cases where the specific data augmentations techniques being discussed are more suitable for these other modalities. For instance, since automated data integration methods are more suitable for tabular data, we cover tabular data in the discussion of data integration methods. The work also presents extensive discussion of techniques for accomplishing each of the major subtasks of the image data augmentation process: search space design, hyperparameter optimization and model evaluation. Finally, we carried out an extensive comparison and analysis of the performance of automated data augmentation techniques and state-of-the-art methods based on classical augmentation approaches. The results show that AutoML methods for data augmentation currently outperform state-of-the-art techniques based on conventional approaches.
Paper Structure (52 sections, 1 equation, 17 figures, 7 tables)

This paper contains 52 sections, 1 equation, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Classical deep learning versus AutoML: In classical deep learning (A), all stages of the machine learning task—data preparation, hyperparameter selection and tuning, model selection and tweaking as well as the evaluation and validation of outcomes– are performed manually. In contrast, AutoML (B) incorporates an automatic tuning mechanism to learn the best parameters and hyperparameters for all these tasks.
  • Figure 2: Bi-level optimization scheme and basic principle of operation of AutoML-based data augmentation methods. The general approach is to jointly optimize two machine learning loops – an outer loop involving augmentation hyperparameter search, and an inner loop that optimizes model parameters.
  • Figure 3: General process of data manipulation-based augmentation using AutoML pipelines. With this approach, a subset of the input data is sampled by a learned sampler for transformation by different candidate augmentation policies. Different augmentation outcomes are produced by varying the ordering and application probabilities of the transformation functions OP1, OP2, .. OPn.
  • Figure 4: ARDA chepurko2020arda, an example of a data acquisition technique that relies on integrating data from multiple sources. In addition to performing feature engineering operations, the technique searches for optimal hyperparameters to integrate data from related tables.
  • Figure 5: A comparison of classical and automated data synthesis approaches. Classical methods of data synthesis involve three distinct steps (A): (1) creation of primitive data elements (e.g., geometric models), (2) model training, and (3) evaluation of results and fine-tuning. With AutoML approaches,however, th entire data synthesis process is carried out end-to-end in a single process. Like in many AutoML procedures, the process typically utilizes a bi-lel optimization scheme to optimize hyperparameters for both synthesis primitives (outer loop) and model hyperparameters (inner loop).
  • ...and 12 more figures