Table of Contents
Fetching ...

LeakageDetector: An Open Source Data Leakage Analysis Tool in Machine Learning Pipelines

Eman Abdullah AlOmar, Catherine DeMario, Roger Shagawat, Brandon Kreiser

TL;DR

The paper addresses Data Leakage in ML pipelines by introducing LeakageDetector, a PyCharm IDE plugin that bundles a Dockerized leakage analysis tool to detect Overlap, Multi-test, and Preprocessing leakage directly in Python code. It extends prior static analysis work with IDE integration, plain-language leakage explanations, and skeleton quick fixes to improve usability and adoptability among practitioners. A preliminary evaluation with 8 participants and 31 analyzed files shows varying user satisfaction and a notable prevalence of Preprocessing leakage, suggesting practical impact and areas for refinement. Overall, LeakageDetector aims to lower barriers to leakage-aware coding, facilitating maintenance and promoting best practices in real-world ML development.

Abstract

Code quality is of paramount importance in all types of software development settings. Our work seeks to enable Machine Learning (ML) engineers to write better code by helping them find and fix instances of Data Leakage in their models. Data Leakage often results from bad practices in writing ML code. As a result, the model effectively ''memorizes'' the data on which it trains, leading to an overly optimistic estimate of the model performance and an inability to make generalized predictions. ML developers must carefully separate their data into training, evaluation, and test sets to avoid introducing Data Leakage into their code. Training data should be used to train the model, evaluation data should be used to repeatedly confirm a model's accuracy, and test data should be used only once to determine the accuracy of a production-ready model. In this paper, we develop LEAKAGEDETECTOR, a Python plugin for the PyCharm IDE that identifies instances of Data Leakage in ML code and provides suggestions on how to remove the leakage.

LeakageDetector: An Open Source Data Leakage Analysis Tool in Machine Learning Pipelines

TL;DR

The paper addresses Data Leakage in ML pipelines by introducing LeakageDetector, a PyCharm IDE plugin that bundles a Dockerized leakage analysis tool to detect Overlap, Multi-test, and Preprocessing leakage directly in Python code. It extends prior static analysis work with IDE integration, plain-language leakage explanations, and skeleton quick fixes to improve usability and adoptability among practitioners. A preliminary evaluation with 8 participants and 31 analyzed files shows varying user satisfaction and a notable prevalence of Preprocessing leakage, suggesting practical impact and areas for refinement. Overall, LeakageDetector aims to lower barriers to leakage-aware coding, facilitating maintenance and promoting best practices in real-world ML development.

Abstract

Code quality is of paramount importance in all types of software development settings. Our work seeks to enable Machine Learning (ML) engineers to write better code by helping them find and fix instances of Data Leakage in their models. Data Leakage often results from bad practices in writing ML code. As a result, the model effectively ''memorizes'' the data on which it trains, leading to an overly optimistic estimate of the model performance and an inability to make generalized predictions. ML developers must carefully separate their data into training, evaluation, and test sets to avoid introducing Data Leakage into their code. Training data should be used to train the model, evaluation data should be used to repeatedly confirm a model's accuracy, and test data should be used only once to determine the accuracy of a production-ready model. In this paper, we develop LEAKAGEDETECTOR, a Python plugin for the PyCharm IDE that identifies instances of Data Leakage in ML code and provides suggestions on how to remove the leakage.

Paper Structure

This paper contains 12 sections, 4 figures.

Figures (4)

  • Figure 1: LeakageDetector in action, showing the identified Data Leakage instances.
  • Figure 2: High-level architecture of LeakageDetector.
  • Figure 3: Distribution of Data Leakage types selected by participants.
  • Figure 4: Participants’ satisfaction with various aspects of the LeakageDetector tool.