Table of Contents
Fetching ...

FuzzDistill: Intelligent Fuzzing Target Selection using Compile-Time Analysis and Machine Learning

Saket Upadhyay

TL;DR

FuzzDistill tackles fuzzing inefficiency by fusing compile-time analysis with machine learning to prioritize high-risk code regions before runtime execution. It introduces a LLVM-based compile-time feature extractor (FuzzDistillCC) and two ML approaches (XGBoost and a Deep Neural Network) to perform binary classification of vulnerability likelihood, trained on Juliet C/C++ datasets. A Flask front-end (FuzzDistillWeb) provides predictions via APIs and includes a cache to accelerate repeated queries, supported by a reproducible data pipeline. Results show XGBoost achieving up to 86.31% accuracy with an AUC-ROC of 0.955 and strong class-wise performance, demonstrating that static program structure signals can meaningfully improve directed fuzzing efficiency.

Abstract

Fuzz testing is a fundamental technique employed to identify vulnerabilities within software systems. However, the process can be protracted and resource-intensive, especially when confronted with extensive codebases. In this work, I present FuzzDistill, an approach that harnesses compile-time data and machine learning to refine fuzzing targets. By analyzing compile-time information, such as function call graphs' features, loop information, and memory operations, FuzzDistill identifies high-priority areas of the codebase that are more probable to contain vulnerabilities. I demonstrate the efficacy of my approach through experiments conducted on real-world software, demonstrating substantial reductions in testing time.

FuzzDistill: Intelligent Fuzzing Target Selection using Compile-Time Analysis and Machine Learning

TL;DR

FuzzDistill tackles fuzzing inefficiency by fusing compile-time analysis with machine learning to prioritize high-risk code regions before runtime execution. It introduces a LLVM-based compile-time feature extractor (FuzzDistillCC) and two ML approaches (XGBoost and a Deep Neural Network) to perform binary classification of vulnerability likelihood, trained on Juliet C/C++ datasets. A Flask front-end (FuzzDistillWeb) provides predictions via APIs and includes a cache to accelerate repeated queries, supported by a reproducible data pipeline. Results show XGBoost achieving up to 86.31% accuracy with an AUC-ROC of 0.955 and strong class-wise performance, demonstrating that static program structure signals can meaningfully improve directed fuzzing efficiency.

Abstract

Fuzz testing is a fundamental technique employed to identify vulnerabilities within software systems. However, the process can be protracted and resource-intensive, especially when confronted with extensive codebases. In this work, I present FuzzDistill, an approach that harnesses compile-time data and machine learning to refine fuzzing targets. By analyzing compile-time information, such as function call graphs' features, loop information, and memory operations, FuzzDistill identifies high-priority areas of the codebase that are more probable to contain vulnerabilities. I demonstrate the efficacy of my approach through experiments conducted on real-world software, demonstrating substantial reductions in testing time.

Paper Structure

This paper contains 21 sections, 5 equations, 14 figures.

Figures (14)

  • Figure 1: Workflow of FuzzDistill
  • Figure 2: Structure of Module in LLVM
  • Figure 3: XGBoost Confusion Matrix
  • Figure 4: XGBoost Precision-Recall curve
  • Figure 5: XGBoost ROC curve
  • ...and 9 more figures