Table of Contents
Fetching ...

Instrumentation and Analysis of Native ML Pipelines via Logical Query Plans

Stefan Grafberger

TL;DR

Problem: ML pipelines couple data integration, preprocessing, and modeling, making correctness and fairness highly sensitive to data handling. Approach: propose an abstract, database-inspired representation of ML pipelines as logical query plans and develop tools to extract plans from native Python code, instrument execution with provenance (mlinspect), and perform data-centric what-if analyses (mlwhatif) with multi-query optimization. Contributions: formal DAG-based pipeline model; provenance-enabled lightweight inspection; pipeline rewriting for what-if analyses; open-source tooling and interactive directions. Significance: enables automated end-to-end validation, monitoring, and analysis of ML pipelines without manual instrumentation, accelerating reliable deployment in practice.

Abstract

Machine Learning (ML) is increasingly used to automate impactful decisions, which leads to concerns regarding their correctness, reliability, and fairness. We envision highly-automated software platforms to assist data scientists with developing, validating, monitoring, and analysing their ML pipelines. In contrast to existing work, our key idea is to extract "logical query plans" from ML pipeline code relying on popular libraries. Based on these plans, we automatically infer pipeline semantics and instrument and rewrite the ML pipelines to enable diverse use cases without requiring data scientists to manually annotate or rewrite their code. First, we developed such an abstract ML pipeline representation together with machinery to extract it from Python code. Next, we used this representation to efficiently instrument static ML pipelines and apply provenance tracking, which enables lightweight screening for common data preparation issues. Finally, we built machinery to automatically rewrite ML pipelines to perform more advanced what-if analyses and proposed using multi-query optimisation for the resulting workloads. In future work, we aim to interactively assist data scientists as they work on their ML pipelines.

Instrumentation and Analysis of Native ML Pipelines via Logical Query Plans

TL;DR

Problem: ML pipelines couple data integration, preprocessing, and modeling, making correctness and fairness highly sensitive to data handling. Approach: propose an abstract, database-inspired representation of ML pipelines as logical query plans and develop tools to extract plans from native Python code, instrument execution with provenance (mlinspect), and perform data-centric what-if analyses (mlwhatif) with multi-query optimization. Contributions: formal DAG-based pipeline model; provenance-enabled lightweight inspection; pipeline rewriting for what-if analyses; open-source tooling and interactive directions. Significance: enables automated end-to-end validation, monitoring, and analysis of ML pipelines without manual instrumentation, accelerating reliable deployment in practice.

Abstract

Machine Learning (ML) is increasingly used to automate impactful decisions, which leads to concerns regarding their correctness, reliability, and fairness. We envision highly-automated software platforms to assist data scientists with developing, validating, monitoring, and analysing their ML pipelines. In contrast to existing work, our key idea is to extract "logical query plans" from ML pipeline code relying on popular libraries. Based on these plans, we automatically infer pipeline semantics and instrument and rewrite the ML pipelines to enable diverse use cases without requiring data scientists to manually annotate or rewrite their code. First, we developed such an abstract ML pipeline representation together with machinery to extract it from Python code. Next, we used this representation to efficiently instrument static ML pipelines and apply provenance tracking, which enables lightweight screening for common data preparation issues. Finally, we built machinery to automatically rewrite ML pipelines to perform more advanced what-if analyses and proposed using multi-query optimisation for the resulting workloads. In future work, we aim to interactively assist data scientists as they work on their ML pipelines.
Paper Structure (7 sections, 2 figures)

This paper contains 7 sections, 2 figures.

Figures (2)

  • Figure 1: ML Pipelines in the real world often join data from multiple data sources, clean and integrate the data, define feature encoding pipelines, and use techniques like data augmentation before finally passing the featurised data to ML models. The model training and evaluation, which is typically the focus of ML research, is only a small part of the process.
  • Figure 2: Example of an ML pipeline in healthcare that predicts which patients are at a higher risk of serious medical complications. The pipeline is implemented using native constructs from the popular pandas and scikit-learn libraries. On the left, we show the source code of the pipeline. On the right, we show the corresponding dataflow graph extracted by our methods. (Operations on the test set and for estimator/transformer fitting are omitted for readability.)