Table of Contents
Fetching ...

Generating Skyline Datasets for Data Science Models

Mengying Wang, Hanchao Ma, Yiyang Bian, Yangxin Fan, Yinghui Wu

TL;DR

The paper addresses generating datasets that optimize multiple ML-performance criteria simultaneously, rather than a single objective, by proposing MODis, a skyline-data-generation framework formalized as a multi-goal finite-state transducer. It establishes a formal model with augment and reduct operators, proves NP-hardness of the skyline data generation problem while giving fixed-parameter tractable results for the multi-objective setting, and offers three approximation algorithms—ApxMODis, BiMODis, and DivMODis—with quantified guarantees. Empirical evaluation on real datasets demonstrates that MODis variants typically outperform single-objective baselines in accuracy, efficiency, and training-cost trade-offs, while DivMODis enhances diversity to mitigate bias. The work provides a principled, scalable approach for multi-objective data discovery in data science pipelines, with potential extensions to distributed and query-optimized skyline data generation.

Abstract

Preparing high-quality datasets required by various data-driven AI and machine learning models has become a cornerstone task in data-driven analysis. Conventional data discovery methods typically integrate datasets towards a single pre-defined quality measure that may lead to bias for downstream tasks. This paper introduces MODis, a framework that discovers datasets by optimizing multiple user-defined, model-performance measures. Given a set of data sources and a model, MODis selects and integrates data sources into a skyline dataset, over which the model is expected to have the desired performance in all the performance measures. We formulate MODis as a multi-goal finite state transducer, and derive three feasible algorithms to generate skyline datasets. Our first algorithm adopts a "reduce-from-universal" strategy, that starts with a universal schema and iteratively prunes unpromising data. Our second algorithm further reduces the cost with a bi-directional strategy that interleaves data augmentation and reduction. We also introduce a diversification algorithm to mitigate the bias in skyline datasets. We experimentally verify the efficiency and effectiveness of our skyline data discovery algorithms, and showcase their applications in optimizing data science pipelines.

Generating Skyline Datasets for Data Science Models

TL;DR

The paper addresses generating datasets that optimize multiple ML-performance criteria simultaneously, rather than a single objective, by proposing MODis, a skyline-data-generation framework formalized as a multi-goal finite-state transducer. It establishes a formal model with augment and reduct operators, proves NP-hardness of the skyline data generation problem while giving fixed-parameter tractable results for the multi-objective setting, and offers three approximation algorithms—ApxMODis, BiMODis, and DivMODis—with quantified guarantees. Empirical evaluation on real datasets demonstrates that MODis variants typically outperform single-objective baselines in accuracy, efficiency, and training-cost trade-offs, while DivMODis enhances diversity to mitigate bias. The work provides a principled, scalable approach for multi-objective data discovery in data science pipelines, with potential extensions to distributed and query-optimized skyline data generation.

Abstract

Preparing high-quality datasets required by various data-driven AI and machine learning models has become a cornerstone task in data-driven analysis. Conventional data discovery methods typically integrate datasets towards a single pre-defined quality measure that may lead to bias for downstream tasks. This paper introduces MODis, a framework that discovers datasets by optimizing multiple user-defined, model-performance measures. Given a set of data sources and a model, MODis selects and integrates data sources into a skyline dataset, over which the model is expected to have the desired performance in all the performance measures. We formulate MODis as a multi-goal finite state transducer, and derive three feasible algorithms to generate skyline datasets. Our first algorithm adopts a "reduce-from-universal" strategy, that starts with a universal schema and iteratively prunes unpromising data. Our second algorithm further reduces the cost with a bi-directional strategy that interleaves data augmentation and reduction. We also introduce a diversification algorithm to mitigate the bias in skyline datasets. We experimentally verify the efficiency and effectiveness of our skyline data discovery algorithms, and showcase their applications in optimizing data science pipelines.

Paper Structure

This paper contains 16 sections, 6 theorems, 8 equations, 15 figures, 6 tables, 4 algorithms.

Key Result

Theorem 1

Skyline data generation is (1) $\mathsf{NP}$-hard; and (2) fixed-parameter tractable, if (a) ${\mathcal{P}}$ is fixed, and (b) $|{\mathcal{D}}_F|$ is polynomially bounded by the input size $|{\mathcal{D}}|$.

Figures (15)

  • Figure 1: Data generation for CI index prediction addressing multiple user-defined ML performance criteria, in order to improve an input ML model.
  • Figure 2: A skyline data generation process, with a part of running graphs, and result datasets.
  • Figure 3: :$\mathsf{ApxMODis}$
  • Figure 4: "Reduct-from-Universal": an illustration of two-level computation. It performs multiple level-wise spawns and updates the $\epsilon$-Skyline set.
  • Figure 5: :$\mathsf{BiMODis}$
  • ...and 10 more figures

Theorems & Definitions (6)

  • Theorem 1
  • Theorem 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5