A Systematic Survey on Debugging Techniques for Machine Learning Systems
Thanh-Dat Nguyen, Haoye Tian, Bach Le, Patanamon Thongtanunam, Shane McIntosh
TL;DR
This systematic survey catalogs ML debugging techniques and maps them to a real-fault taxonomy to reveal how well research addresses practitioners' needs. It constructs a two-tier taxonomy (fault types and debugging methods) via open/validation coding of 96 papers, and extends Humbatova et al.'s fault taxonomy with newly targeted or emerging faults. The study finds that roughly half of identified debugging challenges are addressed in literature, with a majority of real-world issues on GitHub and in practitioner interviews remaining untargeted, underscoring a significant gap between research and practice. It concludes with concrete implications for researchers and framework developers, emphasizing data processing, interpretability, test quality, data bias, framework usability, and standardization as priority areas to advance ML debugging in real-world deployments.
Abstract
Debugging ML software (i.e., the detection, localization and fixing of faults) poses unique challenges compared to traditional software largely due to the probabilistic nature and heterogeneity of its development process. Various methods have been proposed for testing, diagnosing, and repairing ML systems. However, the big picture informing important research directions that really address the dire needs of developers is yet to unfold, leaving several key questions unaddressed: (1) What faults have been targeted in the ML debugging research that fulfill developers needs in practice? (2) How are these faults addressed? (3) What are the challenges in addressing the yet untargeted faults? In this paper, we conduct a systematic study of debugging techniques for machine learning systems. We first collect technical papers focusing on debugging components in machine learning software. We then map these papers to a taxonomy of faults to assess the current state of fault resolution identified in existing literature. Subsequently, we analyze which techniques are used to address specific faults based on the collected papers. This results in a comprehensive taxonomy that aligns faults with their corresponding debugging methods. Finally, we examine previously released transcripts of interviewing developers to identify the challenges in resolving unfixed faults. Our analysis reveals that only 48 percent of the identified ML debugging challenges have been explicitly addressed by researchers, while 46.9 percent remain unresolved or unmentioned. In real world applications, we found that 52.6 percent of issues reported on GitHub and 70.3% of problems discussed in interviews are still unaddressed by research in ML debugging. The study identifies 13 primary challenges in ML debugging, highlighting a significant gap between the identification of ML debugging issues and their resolution in practice.
