Generating Skyline Datasets for Data Science Models
Mengying Wang, Hanchao Ma, Yiyang Bian, Yangxin Fan, Yinghui Wu
TL;DR
The paper addresses generating datasets that optimize multiple ML-performance criteria simultaneously, rather than a single objective, by proposing MODis, a skyline-data-generation framework formalized as a multi-goal finite-state transducer. It establishes a formal model with augment and reduct operators, proves NP-hardness of the skyline data generation problem while giving fixed-parameter tractable results for the multi-objective setting, and offers three approximation algorithms—ApxMODis, BiMODis, and DivMODis—with quantified guarantees. Empirical evaluation on real datasets demonstrates that MODis variants typically outperform single-objective baselines in accuracy, efficiency, and training-cost trade-offs, while DivMODis enhances diversity to mitigate bias. The work provides a principled, scalable approach for multi-objective data discovery in data science pipelines, with potential extensions to distributed and query-optimized skyline data generation.
Abstract
Preparing high-quality datasets required by various data-driven AI and machine learning models has become a cornerstone task in data-driven analysis. Conventional data discovery methods typically integrate datasets towards a single pre-defined quality measure that may lead to bias for downstream tasks. This paper introduces MODis, a framework that discovers datasets by optimizing multiple user-defined, model-performance measures. Given a set of data sources and a model, MODis selects and integrates data sources into a skyline dataset, over which the model is expected to have the desired performance in all the performance measures. We formulate MODis as a multi-goal finite state transducer, and derive three feasible algorithms to generate skyline datasets. Our first algorithm adopts a "reduce-from-universal" strategy, that starts with a universal schema and iteratively prunes unpromising data. Our second algorithm further reduces the cost with a bi-directional strategy that interleaves data augmentation and reduction. We also introduce a diversification algorithm to mitigate the bias in skyline datasets. We experimentally verify the efficiency and effectiveness of our skyline data discovery algorithms, and showcase their applications in optimizing data science pipelines.
