The Hidden Bloat in Machine Learning Systems
Huaifeng Zhang, Ahmed Ali-Eldin
TL;DR
The paper addresses the problem of software bloat in ML systems by introducing Negativa-ML, a tool that debloats both CPU and GPU code inside ML shared libraries. It proposes a two-component approach—a kernel detector that identifies used GPU kernels with low overhead and a kernel locator that locates and retains the necessary cubins and elements—followed by a compaction stage to remove the rest. Evaluated on four ML frameworks across ten workloads and 300+ shared libraries, Negativa-ML achieves up to 55% total file-size reduction, with GPU code reductions up to 75% and CPU code reductions up to 72%, and reports significant improvements in peak memory usage and startup time. The results demonstrate that GPU code is a major source of bloat in ML frameworks and that a small subset of libraries disproportionately contributes to reductions, suggesting practical benefits for deployment in resource-constrained environments and edge data centers.
Abstract
Software bloat refers to code and features that is not used by a software during runtime. For Machine Learning (ML) systems, bloat is a major contributor to their technical debt leading to decreased performance and resource wastage. In this work, we present, Negativa-ML, a novel tool to identify and remove bloat in ML frameworks by analyzing their shared libraries. Our approach includes novel techniques to detect and locate unnecessary code within device code - a key area overlooked by existing research, which focuses primarily on host code. We evaluate Negativa-ML using four popular ML frameworks across ten workloads over 300 shared libraries. The results demonstrate that the ML frameworks are highly bloated on both the device and host code side. On average, Negativa-ML reduces the device code size in these frameworks by up to 75% and the host code by up to 72%, resulting in total file size reductions of up to 55%. The device code is a primary source of bloat within ML frameworks. Through debloating, we achieve reductions in peak host memory usage, peak GPU memory usage, and execution time by up to 74.6%, 69.6%, and 44.6%, respectively.
