Instrumentation and Analysis of Native ML Pipelines via Logical Query Plans
Stefan Grafberger
TL;DR
Problem: ML pipelines couple data integration, preprocessing, and modeling, making correctness and fairness highly sensitive to data handling. Approach: propose an abstract, database-inspired representation of ML pipelines as logical query plans and develop tools to extract plans from native Python code, instrument execution with provenance (mlinspect), and perform data-centric what-if analyses (mlwhatif) with multi-query optimization. Contributions: formal DAG-based pipeline model; provenance-enabled lightweight inspection; pipeline rewriting for what-if analyses; open-source tooling and interactive directions. Significance: enables automated end-to-end validation, monitoring, and analysis of ML pipelines without manual instrumentation, accelerating reliable deployment in practice.
Abstract
Machine Learning (ML) is increasingly used to automate impactful decisions, which leads to concerns regarding their correctness, reliability, and fairness. We envision highly-automated software platforms to assist data scientists with developing, validating, monitoring, and analysing their ML pipelines. In contrast to existing work, our key idea is to extract "logical query plans" from ML pipeline code relying on popular libraries. Based on these plans, we automatically infer pipeline semantics and instrument and rewrite the ML pipelines to enable diverse use cases without requiring data scientists to manually annotate or rewrite their code. First, we developed such an abstract ML pipeline representation together with machinery to extract it from Python code. Next, we used this representation to efficiently instrument static ML pipelines and apply provenance tracking, which enables lightweight screening for common data preparation issues. Finally, we built machinery to automatically rewrite ML pipelines to perform more advanced what-if analyses and proposed using multi-query optimisation for the resulting workloads. In future work, we aim to interactively assist data scientists as they work on their ML pipelines.
