Table of Contents
Fetching ...

A Broad Comparative Evaluation of Software Debloating Tools

Michael D. Brown, Adam Meily, Brian Fairservice, Akshay Sood, Jonathan Dorn, Eric Kilmer, Ronald Eytchison

TL;DR

This work provides a holistic examination of software debloating by first surveying a decade of literature to establish a taxonomy of debloating techniques and metrics, then performing a broad empirical evaluation of ten debloating tools across twenty benchmarks with twelve metrics. The study reveals substantial gaps in tool maturity and soundness, with only limited improvements in performance or security for most debloated programs. A novel differential fuzzing tool, DIFFER, exposed frequent correctness and robustness issues, underscoring the risk of premature adoption without rigorous post-debloating validation. Overall, the paper highlights the need for more robust, versatile debloating methods and standardized evaluation practices to make debloating viable for real-world software.

Abstract

Software debloating tools seek to improve program security and performance by removing unnecessary code, called bloat. While many techniques have been proposed, several barriers to their adoption have emerged. Namely, debloating tools are highly specialized, making it difficult for adopters to find the right type of tool for their needs. This is further hindered by a lack of established metrics and comparative evaluations between tools. To close this information gap, we surveyed 10 years of debloating literature and several tools currently under commercial development to taxonomize knowledge about the debloating ecosystem. We then conducted a broad comparative evaluation of 10 debloating tools to determine their relative strengths and weaknesses. Our evaluation, conducted on a diverse set of 20 benchmark programs, measures tools across 12 performance, security, and correctness metrics. Our evaluation surfaces several concerning findings that contradict the prevailing narrative in the debloating literature. First, debloating tools lack the maturity required to be used on real-world software, evidenced by a slim 22% overall success rate for creating passable debloated versions of medium- and high-complexity benchmarks. Second, debloating tools struggle to produce sound and robust programs. Using our novel differential fuzzing tool, DIFFER, we discovered that only 13% of our debloating attempts produced a sound and robust debloated program. Finally, our results indicate that debloating tools typically do not improve the performance or security posture of debloated programs by a significant degree according to our evaluation metrics. We believe that our contributions in this paper will help potential adopters better understand the landscape of tools and will motivate future research and development of more capable debloating tools.

A Broad Comparative Evaluation of Software Debloating Tools

TL;DR

This work provides a holistic examination of software debloating by first surveying a decade of literature to establish a taxonomy of debloating techniques and metrics, then performing a broad empirical evaluation of ten debloating tools across twenty benchmarks with twelve metrics. The study reveals substantial gaps in tool maturity and soundness, with only limited improvements in performance or security for most debloated programs. A novel differential fuzzing tool, DIFFER, exposed frequent correctness and robustness issues, underscoring the risk of premature adoption without rigorous post-debloating validation. Overall, the paper highlights the need for more robust, versatile debloating methods and standardized evaluation practices to make debloating viable for real-world software.

Abstract

Software debloating tools seek to improve program security and performance by removing unnecessary code, called bloat. While many techniques have been proposed, several barriers to their adoption have emerged. Namely, debloating tools are highly specialized, making it difficult for adopters to find the right type of tool for their needs. This is further hindered by a lack of established metrics and comparative evaluations between tools. To close this information gap, we surveyed 10 years of debloating literature and several tools currently under commercial development to taxonomize knowledge about the debloating ecosystem. We then conducted a broad comparative evaluation of 10 debloating tools to determine their relative strengths and weaknesses. Our evaluation, conducted on a diverse set of 20 benchmark programs, measures tools across 12 performance, security, and correctness metrics. Our evaluation surfaces several concerning findings that contradict the prevailing narrative in the debloating literature. First, debloating tools lack the maturity required to be used on real-world software, evidenced by a slim 22% overall success rate for creating passable debloated versions of medium- and high-complexity benchmarks. Second, debloating tools struggle to produce sound and robust programs. Using our novel differential fuzzing tool, DIFFER, we discovered that only 13% of our debloating attempts produced a sound and robust debloated program. Finally, our results indicate that debloating tools typically do not improve the performance or security posture of debloated programs by a significant degree according to our evaluation metrics. We believe that our contributions in this paper will help potential adopters better understand the landscape of tools and will motivate future research and development of more capable debloating tools.
Paper Structure (28 sections, 8 tables)