Table of Contents
Fetching ...

Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"

Stefan Grafberger, Paul Groth, Sebastian Schelter

TL;DR

The paper addresses the challenge of iteratively debugging ML data preparation pipelines by proposing shadow pipelines that run as hidden variants to automatically detect issues and propose fixes with provenance explanations and impact estimates. It leverages incremental view maintenance to achieve low-latency analysis and maintain shadow pipelines alongside the original pipeline. The authors formalize the problem, outline the shadow-pipeline architecture, and present preliminary experiments showing significant runtime and maintenance speedups. This approach has the potential to dramatically accelerate reliable pipeline development and reduce manual rewriting, with future work including broader integration with LLM-based code suggestions.

Abstract

Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and error-prone. Therefore, we propose to support data scientists during this development cycle with automatically derived interactive suggestions for pipeline improvements. We discuss our vision to generate these suggestions with so-called shadow pipelines, hidden variants of the original pipeline that modify it to auto-detect potential issues, try out modifications for improvements, and suggest and explain these modifications to the user. We envision to apply incremental view maintenance-based optimisations to ensure low-latency computation and maintenance of the shadow pipelines. We conduct preliminary experiments to showcase the feasibility of our envisioned approach and the potential benefits of our proposed optimisations.

Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"

TL;DR

The paper addresses the challenge of iteratively debugging ML data preparation pipelines by proposing shadow pipelines that run as hidden variants to automatically detect issues and propose fixes with provenance explanations and impact estimates. It leverages incremental view maintenance to achieve low-latency analysis and maintain shadow pipelines alongside the original pipeline. The authors formalize the problem, outline the shadow-pipeline architecture, and present preliminary experiments showing significant runtime and maintenance speedups. This approach has the potential to dramatically accelerate reliable pipeline development and reduce manual rewriting, with future work including broader integration with LLM-based code suggestions.

Abstract

Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and error-prone. Therefore, we propose to support data scientists during this development cycle with automatically derived interactive suggestions for pipeline improvements. We discuss our vision to generate these suggestions with so-called shadow pipelines, hidden variants of the original pipeline that modify it to auto-detect potential issues, try out modifications for improvements, and suggest and explain these modifications to the user. We envision to apply incremental view maintenance-based optimisations to ensure low-latency computation and maintenance of the shadow pipelines. We conduct preliminary experiments to showcase the feasibility of our envisioned approach and the potential benefits of our proposed optimisations.
Paper Structure (5 sections, 3 figures)

This paper contains 5 sections, 3 figures.

Figures (3)

  • Figure 1: Our vision -- several automatically maintained "shadow pipelines" give actionable suggestions on how to improve a user's ML pipeline code at development time.
  • Figure 2: Benefits of our proposed optimisations for computing shadow pipelines. Our shadow pipeline optimisations decrease the runtime by up to a factor of 38.
  • Figure 3: Benefits of our proposed optimisations for incrementally updating the pipelines. The runtime (shown on a log scale) is less than one second in all but one scenarios.