Table of Contents
Fetching ...

Comment on Revisiting Neural Program Smoothing for Fuzzing

Dongdong She, Kexin Pei, Junfeng Yang, Baishakhi Ray, Suman Jana

TL;DR

The paper tackles the problem of unreliable reevaluation of ML-based fuzzers by dissecting MLFuzz's critique of NEUZZ. It identifies critical flaws across initialization, a Python API crash, training-data collection, and result reporting, then provides a corrected implementation and evaluation on the Google FuzzBench dataset. After fixing these issues and running 24-hour persistent-mode experiments, the authors show that NEUZZ maintains reported performance advantages over AFL and achieves throughput comparable to AFL, countering MLFuzz's claims. The work also outlines concrete guidelines for fair, reproducible fuzzing evaluations and emphasizes robust debugging and diverse benchmarks to avoid misleading conclusions in revisiting prior fuzzing results.

Abstract

MLFuzz, a work accepted at ACM FSE 2023, revisits the performance of a machine learning-based fuzzer, NEUZZ. We demonstrate that its main conclusion is entirely wrong due to several fatal bugs in the implementation and wrong evaluation setups, including an initialization bug in persistent mode, a program crash, an error in training dataset collection, and a mistake in fuzzing result collection. Additionally, MLFuzz uses noisy training datasets without sufficient data cleaning and preprocessing, which contributes to a drastic performance drop in NEUZZ. We address these issues and provide a corrected implementation and evaluation setup, showing that NEUZZ consistently performs well over AFL on the FuzzBench dataset. Finally, we reflect on the evaluation methods used in MLFuzz and offer practical advice on fair and scientific fuzzing evaluations.

Comment on Revisiting Neural Program Smoothing for Fuzzing

TL;DR

The paper tackles the problem of unreliable reevaluation of ML-based fuzzers by dissecting MLFuzz's critique of NEUZZ. It identifies critical flaws across initialization, a Python API crash, training-data collection, and result reporting, then provides a corrected implementation and evaluation on the Google FuzzBench dataset. After fixing these issues and running 24-hour persistent-mode experiments, the authors show that NEUZZ maintains reported performance advantages over AFL and achieves throughput comparable to AFL, countering MLFuzz's claims. The work also outlines concrete guidelines for fair, reproducible fuzzing evaluations and emphasizes robust debugging and diverse benchmarks to avoid misleading conclusions in revisiting prior fuzzing results.

Abstract

MLFuzz, a work accepted at ACM FSE 2023, revisits the performance of a machine learning-based fuzzer, NEUZZ. We demonstrate that its main conclusion is entirely wrong due to several fatal bugs in the implementation and wrong evaluation setups, including an initialization bug in persistent mode, a program crash, an error in training dataset collection, and a mistake in fuzzing result collection. Additionally, MLFuzz uses noisy training datasets without sufficient data cleaning and preprocessing, which contributes to a drastic performance drop in NEUZZ. We address these issues and provide a corrected implementation and evaluation setup, showing that NEUZZ consistently performs well over AFL on the FuzzBench dataset. Finally, we reflect on the evaluation methods used in MLFuzz and offer practical advice on fair and scientific fuzzing evaluations.
Paper Structure (14 sections, 1 figure, 2 tables)