Table of Contents
Fetching ...

May the Feedback Be with You! Unlocking the Power of Feedback-Driven Deep Learning Framework Fuzzing via LLMs

Shaoyu Yang, Chunrong Fang, Haifeng Lin, Xiang Chen, Jia Liu, Zhenyu Chen

TL;DR

FUEL introduces a two-agent, feedback-driven fuzzing framework for DL frameworks that leverages an analysis LLM to distill rich feedback into concise guidance and a generation LLM to produce diverse, valid tests. It adds a feedback-aware simulated annealing-based operator selection (FASA) to broaden test coverage and a program self-repair module to fix invalid tests, all guided by differential testing across eager and compiler backends. The approach yields measurable gains in line coverage on PyTorch and TensorFlow (4.48% and 9.14% respectively) and uncovers real-world bugs, including CVEs, with several confirmed fixes. Empirical results demonstrate strong performance against state-of-the-art baselines and show promising practical impact for automated DL framework testing, while also highlighting trade-offs in compute. FUEL’s framework and artifacts offer a scalable blueprint for feedback-driven fuzzing in other large software systems as LLM capabilities continue to advance.

Abstract

Deep Learning (DL) frameworks have served as fundamental components in DL systems over the last decade. However, bugs in DL frameworks could lead to catastrophic consequences in critical scenarios. A simple yet effective way to find bugs in DL frameworks is fuzz testing (Fuzzing). Existing approaches focus on test generation, leaving execution results with high semantic value (e.g., coverage information, bug reports, and exception logs) in the wild, which can serve as multiple types of feedback. To fill this gap, we propose FUEL to effectively utilize the feedback information, which comprises two Large Language Models (LLMs): analysis LLM and generation LLM. Specifically, analysis LLM infers analysis summaries from feedback information, while the generation LLM creates tests guided by these summaries. Furthermore, based on multiple feedback guidance, we design two additional components: (i) a feedback-aware simulated annealing algorithm to select operators for test generation, enriching test diversity. (ii) a program self-repair strategy to automatically repair invalid tests, enhancing test validity. We evaluate FUEL on the two most popular DL frameworks, and experiment results show that FUEL can improve line code coverage of PyTorch and TensorFlow by 4.48% and 9.14% over four state-of-the-art baselines. By the time of submission, FUEL has detected 104 previously unknown bugs for PyTorch and TensorFlow, with 93 confirmed as new bugs, 53 already fixed. 14 vulnerabilities have been assigned CVE IDs, among which 7 are rated as high-severity with a CVSS score of "7.5 HIGH". Our artifact is available at https://github.com/NJU-iSE/FUEL

May the Feedback Be with You! Unlocking the Power of Feedback-Driven Deep Learning Framework Fuzzing via LLMs

TL;DR

FUEL introduces a two-agent, feedback-driven fuzzing framework for DL frameworks that leverages an analysis LLM to distill rich feedback into concise guidance and a generation LLM to produce diverse, valid tests. It adds a feedback-aware simulated annealing-based operator selection (FASA) to broaden test coverage and a program self-repair module to fix invalid tests, all guided by differential testing across eager and compiler backends. The approach yields measurable gains in line coverage on PyTorch and TensorFlow (4.48% and 9.14% respectively) and uncovers real-world bugs, including CVEs, with several confirmed fixes. Empirical results demonstrate strong performance against state-of-the-art baselines and show promising practical impact for automated DL framework testing, while also highlighting trade-offs in compute. FUEL’s framework and artifacts offer a scalable blueprint for feedback-driven fuzzing in other large software systems as LLM capabilities continue to advance.

Abstract

Deep Learning (DL) frameworks have served as fundamental components in DL systems over the last decade. However, bugs in DL frameworks could lead to catastrophic consequences in critical scenarios. A simple yet effective way to find bugs in DL frameworks is fuzz testing (Fuzzing). Existing approaches focus on test generation, leaving execution results with high semantic value (e.g., coverage information, bug reports, and exception logs) in the wild, which can serve as multiple types of feedback. To fill this gap, we propose FUEL to effectively utilize the feedback information, which comprises two Large Language Models (LLMs): analysis LLM and generation LLM. Specifically, analysis LLM infers analysis summaries from feedback information, while the generation LLM creates tests guided by these summaries. Furthermore, based on multiple feedback guidance, we design two additional components: (i) a feedback-aware simulated annealing algorithm to select operators for test generation, enriching test diversity. (ii) a program self-repair strategy to automatically repair invalid tests, enhancing test validity. We evaluate FUEL on the two most popular DL frameworks, and experiment results show that FUEL can improve line code coverage of PyTorch and TensorFlow by 4.48% and 9.14% over four state-of-the-art baselines. By the time of submission, FUEL has detected 104 previously unknown bugs for PyTorch and TensorFlow, with 93 confirmed as new bugs, 53 already fixed. 14 vulnerabilities have been assigned CVE IDs, among which 7 are rated as high-severity with a CVSS score of "7.5 HIGH". Our artifact is available at https://github.com/NJU-iSE/FUEL

Paper Structure

This paper contains 33 sections, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: Motivation example of illustrating Limitation #1&2
  • Figure 2: Motivation example of illustrating Limitation #3
  • Figure 3: Motivation example of illustrating Limitation #4
  • Figure 4: Overview of FUEL
  • Figure 5: Default prompt template of FUEL
  • ...and 6 more figures