Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models
Kunpeng Zhang, Shuai Wang, Jitao Han, Xiaogang Zhu, Xian Li, Shaohua Wang, Sheng Wen
TL;DR
This work tackles the difficulty of fuzzing large DL libraries with thousands of APIs by introducing DFuzz, a white-box, LLM-driven fuzzing framework. DFuzz first uses a three-part pipeline (edge case extraction, initial program generation, and edge case-based mutation) to leverage low-level API checks and infer transferable edge cases, then synthesizes and mutates test programs to stress APIs efficiently. Empirically, DFuzz achieves higher API coverage than state-of-the-art LLM-based fuzzers and discovers 37 bugs across PyTorch and TensorFlow, including many not found by prior tools, with several already fixed or under developer investigation; crucially, many edge cases transfer across APIs with the same input types and even across frameworks. The practical impact is a scalable, automated approach to DL library QA that reduces LLM usage while improving bug discovery, with a public artifact to support reproducibility and further research.
Abstract
Deep learning (DL) libraries, widely used in AI applications, often contain vulnerabilities like buffer overflows and use-after-free errors. Traditional fuzzing struggles with the complexity and API diversity of DL libraries such as TensorFlow and PyTorch, which feature over 1,000 APIs. Testing all these APIs is challenging due to complex inputs and varied usage patterns. While large language models (LLMs) show promise in code understanding and generation, existing LLM-based fuzzers lack deep knowledge of API edge cases and struggle with test input generation. To address this, we propose DFUZZ, an LLM-driven fuzzing approach for DL libraries. DFUZZ leverages two insights: (1) LLMs can reason about error-triggering edge cases from API code and apply this knowledge to untested APIs, and (2) LLMs can accurately synthesize test programs to automate API testing. By providing LLMs with a "white-box view" of APIs, DFUZZ enhances reasoning and generation for comprehensive fuzzing. Experimental results show that DFUZZ outperforms state-of-the-art fuzzers in API coverage for TensorFlow and PyTorch, uncovering 37 bugs, with 8 fixed and 19 under developer investigation.
