Transforming Datasets to Requested Complexity with Projection-based Many-Objective Genetic Algorithm
Joanna Komorniczak
TL;DR
This paper introduces Evolutionary Projection-based Complexity Optimization (EPCO), a projection-based genetic algorithm that transforms a synthetic, $d$-dimensional dataset via a $d \times d$ transformation matrix to reach targeted complexity profiles across classification and regression tasks. By optimizing a set of complexity metafeatures $\mathcal{C}$ toward targets $\mathcal{T}$, EPCO generates multiple datasets at five predefined difficulty levels, enabling controlled benchmarking of ML methods. Empirical results show that higher dataset complexity generally lowers classification accuracy and raises regression MAE for state-of-the-art learners, validating the link between metafeature targets and recognition performance. The approach offers a flexible, continuous, and model-agnostic means of synthetic data generation for robust evaluation and fair benchmarking in ML research.
Abstract
The research community continues to seek increasingly more advanced synthetic data generators to reliably evaluate the strengths and limitations of machine learning methods. This work aims to increase the availability of datasets encompassing a diverse range of problem complexities by proposing a genetic algorithm that optimizes a set of problem complexity measures for classification and regression tasks towards specific targets. For classification, a set of 10 complexity measures was used, while for regression tasks, 4 measures demonstrating promising optimization capabilities were selected. Experiments confirmed that the proposed genetic algorithm can generate datasets with varying levels of difficulty by transforming synthetically created datasets to achieve target complexity values through linear feature projections. Evaluations involving state-of-the-art classifiers and regressors revealed a correlation between the complexity of the generated data and the recognition quality.
