Abstract Interpretation-Based Data Leakage Static Analysis
Filip Drobnjaković, Pavle Subotić, Caterina Urban
TL;DR
The paper addresses data leakage in ML models by introducing a static analysis grounded in abstract interpretation that proves absence of leakage by ensuring that training and testing data originate from disjoint inputs. It constructs a formal semantic pipeline beginning with concrete trace semantics, deriving a sound and computable abstract semantics via a dependency abstraction and a data-leakage semantics, then implements the analysis in the NBLyzer framework. The empirical evaluation on over 2000 Kaggle notebooks demonstrates the method’s effectiveness, reporting 25 real data-leakage cases with a precision of $93\%$ and a ~60% improvement in detection over previous ad-hoc methods, at a modest ~7% slowdown. The work contributes a rigorous semantic foundation for data-leakage analysis in data-manipulating programs and shows practical utility for early-leakage detection in notebook-based workflows.
Abstract
Data leakage is a well-known problem in machine learning. Data leakage occurs when information from outside the training dataset is used to create a model. This phenomenon renders a model excessively optimistic or even useless in the real world since the model tends to leverage greatly on the unfairly acquired information. To date, detection of data leakages occurs post-mortem using run-time methods. However, due to the insidious nature of data leakage, it may not be apparent to a data scientist that a data leakage has occurred in the first place. For this reason, it is advantageous to detect data leakages as early as possible in the development life cycle. In this paper, we propose a novel static analysis to detect several instances of data leakages during development time. We define our analysis using the framework of abstract interpretation: we define a concrete semantics that is sound and complete, from which we derive a sound and computable abstract semantics. We implement our static analysis inside the open-source NBLyzer static analysis framework and demonstrate its utility by evaluating its performance and precision on over 2000 Kaggle competition notebooks.
