FuzzDistill: Intelligent Fuzzing Target Selection using Compile-Time Analysis and Machine Learning
Saket Upadhyay
TL;DR
FuzzDistill tackles fuzzing inefficiency by fusing compile-time analysis with machine learning to prioritize high-risk code regions before runtime execution. It introduces a LLVM-based compile-time feature extractor (FuzzDistillCC) and two ML approaches (XGBoost and a Deep Neural Network) to perform binary classification of vulnerability likelihood, trained on Juliet C/C++ datasets. A Flask front-end (FuzzDistillWeb) provides predictions via APIs and includes a cache to accelerate repeated queries, supported by a reproducible data pipeline. Results show XGBoost achieving up to 86.31% accuracy with an AUC-ROC of 0.955 and strong class-wise performance, demonstrating that static program structure signals can meaningfully improve directed fuzzing efficiency.
Abstract
Fuzz testing is a fundamental technique employed to identify vulnerabilities within software systems. However, the process can be protracted and resource-intensive, especially when confronted with extensive codebases. In this work, I present FuzzDistill, an approach that harnesses compile-time data and machine learning to refine fuzzing targets. By analyzing compile-time information, such as function call graphs' features, loop information, and memory operations, FuzzDistill identifies high-priority areas of the codebase that are more probable to contain vulnerabilities. I demonstrate the efficacy of my approach through experiments conducted on real-world software, demonstrating substantial reductions in testing time.
