Table of Contents
Fetching ...

Transforming Datasets to Requested Complexity with Projection-based Many-Objective Genetic Algorithm

Joanna Komorniczak

TL;DR

This paper introduces Evolutionary Projection-based Complexity Optimization (EPCO), a projection-based genetic algorithm that transforms a synthetic, $d$-dimensional dataset via a $d \times d$ transformation matrix to reach targeted complexity profiles across classification and regression tasks. By optimizing a set of complexity metafeatures $\mathcal{C}$ toward targets $\mathcal{T}$, EPCO generates multiple datasets at five predefined difficulty levels, enabling controlled benchmarking of ML methods. Empirical results show that higher dataset complexity generally lowers classification accuracy and raises regression MAE for state-of-the-art learners, validating the link between metafeature targets and recognition performance. The approach offers a flexible, continuous, and model-agnostic means of synthetic data generation for robust evaluation and fair benchmarking in ML research.

Abstract

The research community continues to seek increasingly more advanced synthetic data generators to reliably evaluate the strengths and limitations of machine learning methods. This work aims to increase the availability of datasets encompassing a diverse range of problem complexities by proposing a genetic algorithm that optimizes a set of problem complexity measures for classification and regression tasks towards specific targets. For classification, a set of 10 complexity measures was used, while for regression tasks, 4 measures demonstrating promising optimization capabilities were selected. Experiments confirmed that the proposed genetic algorithm can generate datasets with varying levels of difficulty by transforming synthetically created datasets to achieve target complexity values through linear feature projections. Evaluations involving state-of-the-art classifiers and regressors revealed a correlation between the complexity of the generated data and the recognition quality.

Transforming Datasets to Requested Complexity with Projection-based Many-Objective Genetic Algorithm

TL;DR

This paper introduces Evolutionary Projection-based Complexity Optimization (EPCO), a projection-based genetic algorithm that transforms a synthetic, -dimensional dataset via a transformation matrix to reach targeted complexity profiles across classification and regression tasks. By optimizing a set of complexity metafeatures toward targets , EPCO generates multiple datasets at five predefined difficulty levels, enabling controlled benchmarking of ML methods. Empirical results show that higher dataset complexity generally lowers classification accuracy and raises regression MAE for state-of-the-art learners, validating the link between metafeature targets and recognition performance. The approach offers a flexible, continuous, and model-agnostic means of synthetic data generation for robust evaluation and fair benchmarking in ML research.

Abstract

The research community continues to seek increasingly more advanced synthetic data generators to reliably evaluate the strengths and limitations of machine learning methods. This work aims to increase the availability of datasets encompassing a diverse range of problem complexities by proposing a genetic algorithm that optimizes a set of problem complexity measures for classification and regression tasks towards specific targets. For classification, a set of 10 complexity measures was used, while for regression tasks, 4 measures demonstrating promising optimization capabilities were selected. Experiments confirmed that the proposed genetic algorithm can generate datasets with varying levels of difficulty by transforming synthetically created datasets to achieve target complexity values through linear feature projections. Evaluations involving state-of-the-art classifiers and regressors revealed a correlation between the complexity of the generated data and the recognition quality.

Paper Structure

This paper contains 21 sections, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Principal components of the dataset produced using random feature projections of a 10-dimensional synthetic classification problem. The original data is shown in the first plot, and the results of feature transformations in the following ones.
  • Figure 2: The order of individuals in a population. The example presents the head of a population for three criteria: $C_0$, $C_1$, and $C_2$. Individuals are sorted according to fitness for each criterion and their sum, $\Sigma_C$. Four leading individuals of the population have the highest fitness according to individual criteria and their sum.
  • Figure 3: The optimization of three complexity measures -- F1, N1, and ClsCoef. The first 3 subfigures show the relationships between the criterion pairs. The individual points indicate the population's fitness at a given optimization stage. $X$ markers indicate leaders of the population. The last plot shows the leaders' scores across 200 iterations.
  • Figure 4: The results of the classification accuracy for datasets transformed with epco towards various difficulty levels. The larger complexity of the data resulted in lower accuracy scores for all evaluated classifiers.
  • Figure 5: The results of the mean absolute error obtained by the evaluated regression methods for datasets transformed with epco towards target difficulties. More challenging problems are associated with higher recognition errors.