SubStrat: A Subset-Based Strategy for Faster AutoML

Teddy Lazebnik; Amit Somech; Abraham Itzhak Weinberg

SubStrat: A Subset-Based Strategy for Faster AutoML

Teddy Lazebnik, Amit Somech, Abraham Itzhak Weinberg

TL;DR

SubStrat tackles AutoML's high computational cost on large datasets by introducing a subset-based strategy that preserves data characteristics using a dataset-measure, implemented as a measure-preserving data-subset (DST) discovered with a genetic algorithm. It first runs the target AutoML tool on the DST to obtain an intermediate configuration $M'$ and then fine-tunes the result by a constrained AutoML pass on the full dataset to produce $M_{sub}$. Experiments with Auto-Sklearn and TPOT across 10 datasets show substantial time reductions (about $79\%$ on average) with minimal accuracy loss (relative accuracy well above $95\%$ on most tasks). Compared to baselines such as random DSTs and standard row/column sampling, SubStrat consistently achieves higher speedups while preserving predictive performance, highlighting the practical value of data-size reduction in AutoML.

Abstract

Automated machine learning (AutoML) frameworks have become important tools in the data scientists' arsenal, as they dramatically reduce the manual work devoted to the construction of ML pipelines. Such frameworks intelligently search among millions of possible ML pipelines - typically containing feature engineering, model selection and hyper parameters tuning steps - and finally output an optimal pipeline in terms of predictive accuracy. However, when the dataset is large, each individual configuration takes longer to execute, therefore the overall AutoML running times become increasingly high. To this end, we present SubStrat, an AutoML optimization strategy that tackles the data size, rather than configuration space. It wraps existing AutoML tools, and instead of executing them directly on the entire dataset, SubStrat uses a genetic-based algorithm to find a small yet representative data subset which preserves a particular characteristic of the full data. It then employs the AutoML tool on the small subset, and finally, it refines the resulted pipeline by executing a restricted, much shorter, AutoML process on the large dataset. Our experimental results, performed on two popular AutoML frameworks, Auto-Sklearn and TPOT, show that SubStrat reduces their running times by 79% (on average), with less than 2% average loss in the accuracy of the resulted ML pipeline.

SubStrat: A Subset-Based Strategy for Faster AutoML

TL;DR

and then fine-tunes the result by a constrained AutoML pass on the full dataset to produce

. Experiments with Auto-Sklearn and TPOT across 10 datasets show substantial time reductions (about

on average) with minimal accuracy loss (relative accuracy well above

on most tasks). Compared to baselines such as random DSTs and standard row/column sampling, SubStrat consistently achieves higher speedups while preserving predictive performance, highlighting the practical value of data-size reduction in AutoML.

Abstract

Paper Structure (15 sections, 13 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 15 sections, 13 equations, 5 figures, 4 tables, 1 algorithm.

introduction
Problem & Solution Overview
Related Work
Solution Architecture
Measure-Preserving Data Subsets
DST as an Optimization Problem
A Genetic-Based Algorithm for Finding DST
Fine-Tuning the Intermediate Configuration
Experiments
Setup & Methodology
Baseline Methods
Overall Baseline Comparison Results
Time-Reduction & Accuracy Trade-off
Effect of DST Size (Length and Width)
Conclusion & Future Work

Figures (5)

Figure 1: SubStrat Workflow
Figure 2: Per-Dataset Performance
Figure 3: SubStrat settings Skyline
Figure 4: Overall Effect of DST Size
Figure 5: Isolated Effect of DST Length and Width

Theorems & Definitions (3)

definition 1: Data Subset (DST)
definition 2: Measure-Preserving DST
definition 3: Dataset Entropy

SubStrat: A Subset-Based Strategy for Faster AutoML

TL;DR

Abstract

SubStrat: A Subset-Based Strategy for Faster AutoML

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (3)