Unboxing Default Argument Breaking Changes in 1 + 2 Data Science Libraries
João Eduardo Montandon, Luciana Lourdes Silva, Cristiano Politowski, Daniel Prates, Arthur de Brito Bonifácio, Ghizlane El Boussaidi
TL;DR
This work introduces Default Argument Breaking Changes (DABCs), a semantic form of API breakage arising when library default parameter values change. By examining Scikit-Learn, NumPy, and Pandas, the authors identify $93$ DABCs across eight major versions (Scikit-Learn) and across minor/patch releases (NumPy, Pandas), and estimate that these changes affect hundreds of thousands of client applications, with Scikit-Learn (~$35\%$) most impacted. The study combines manual inspection of versionchanged notes, a large-scale analysis of client notebooks, and a static matching pipeline to quantify how often clients rely on defaults and how vulnerable they are to DABCs. Extended analysis investigates why maintainers introduce DABCs (e.g., Maintainability, API Compatibility, Bug Fixing) and what effects DABCs have on clients (Aesthetics, Behavior, Performance, Refactoring), offering practical guidance for researchers, library maintainers, and users to mitigate these risks. The work emphasizes ripple effects of DABCs, advocates for more principled versioning, and provides a replication package to support broader investigation and tooling for detecting and managing such changes in data-science libraries.
Abstract
Data Science (DS) has become a cornerstone for modern software, enabling data-driven decisions to improve companies services. Following modern software development practices, data scientists use third-party libraries to support their tasks. As the APIs provided by these tools often require an extensive list of arguments to be set up, data scientists rely on default values to simplify their usage. It turns out that these default values can change over time, leading to a specific type of breaking change, defined as Default Argument Breaking Change (DABC). This work reveals 93 DABCs in three Python libraries frequently used in Data Science tasks -- Scikit Learn, NumPy, and Pandas -- studying their potential impact on more than 500K client applications. We find out that the occurrence of DABCs varies significantly depending on the library; 35% of Scikit Learn clients are affected, while only 0.13% of NumPy clients are impacted. The main reason for introducing DABCs is to enhance API maintainability, but they often change the function's behavior. We discuss the importance of managing DABCs in third-party DS libraries and provide insights for developers to mitigate the potential impact of these changes in their applications.
