Table of Contents
Fetching ...

Beyond Random Inputs: A Novel ML-Based Hardware Fuzzing

Mohamadreza Rostami, Marco Chilese, Shaza Zeitouni, Rahul Kande, Jeyavijayan Rajendran, Ahmad-Reza Sadeghi

TL;DR

This paper tackles the challenge of comprehensively testing complex hardware, where traditional methods struggle to scale and cover intricate components. It introduces ChatFuzz, a three-step ML-based hardware fuzzer that leverages large language models and reinforcement learning guided by RTL/ISA coverage signals to generate interdependent instruction sequences. Empirical results on the RISC-V RocketCore and Boom show ChatFuzz delivering near-70-80% condition coverage in far shorter times (e.g., 74.96% in under an hour on RocketCore and 97.02% on Boom) and uncovering over 100 unique mismatches plus two novel bugs, along with ISA and trace discrepancies. The approach demonstrates a fast, scalable path to deeper hardware vulnerability discovery and provides generalizable techniques for coverage-driven input synthesis in processor fuzzing.

Abstract

Modern computing systems heavily rely on hardware as the root of trust. However, their increasing complexity has given rise to security-critical vulnerabilities that cross-layer at-tacks can exploit. Traditional hardware vulnerability detection methods, such as random regression and formal verification, have limitations. Random regression, while scalable, is slow in exploring hardware, and formal verification techniques are often concerned with manual effort and state explosions. Hardware fuzzing has emerged as an effective approach to exploring and detecting security vulnerabilities in large-scale designs like modern processors. They outperform traditional methods regarding coverage, scalability, and efficiency. However, state-of-the-art fuzzers struggle to achieve comprehensive coverage of intricate hardware designs within a practical timeframe, often falling short of a 70% coverage threshold. We propose a novel ML-based hardware fuzzer, ChatFuzz, to address this challenge. Ourapproach leverages LLMs like ChatGPT to understand processor language, focusing on machine codes and generating assembly code sequences. RL is integrated to guide the input generation process by rewarding the inputs using code coverage metrics. We use the open-source RISCV-based RocketCore processor as our testbed. ChatFuzz achieves condition coverage rate of 75% in just 52 minutes compared to a state-of-the-art fuzzer, which requires a lengthy 30-hour window to reach a similar condition coverage. Furthermore, our fuzzer can attain 80% coverage when provided with a limited pool of 10 simulation instances/licenses within a 130-hour window. During this time, it conducted a total of 199K test cases, of which 6K produced discrepancies with the processor's golden model. Our analysis identified more than 10 unique mismatches, including two new bugs in the RocketCore and discrepancies from the RISC-V ISA Simulator.

Beyond Random Inputs: A Novel ML-Based Hardware Fuzzing

TL;DR

This paper tackles the challenge of comprehensively testing complex hardware, where traditional methods struggle to scale and cover intricate components. It introduces ChatFuzz, a three-step ML-based hardware fuzzer that leverages large language models and reinforcement learning guided by RTL/ISA coverage signals to generate interdependent instruction sequences. Empirical results on the RISC-V RocketCore and Boom show ChatFuzz delivering near-70-80% condition coverage in far shorter times (e.g., 74.96% in under an hour on RocketCore and 97.02% on Boom) and uncovering over 100 unique mismatches plus two novel bugs, along with ISA and trace discrepancies. The approach demonstrates a fast, scalable path to deeper hardware vulnerability discovery and provides generalizable techniques for coverage-driven input synthesis in processor fuzzing.

Abstract

Modern computing systems heavily rely on hardware as the root of trust. However, their increasing complexity has given rise to security-critical vulnerabilities that cross-layer at-tacks can exploit. Traditional hardware vulnerability detection methods, such as random regression and formal verification, have limitations. Random regression, while scalable, is slow in exploring hardware, and formal verification techniques are often concerned with manual effort and state explosions. Hardware fuzzing has emerged as an effective approach to exploring and detecting security vulnerabilities in large-scale designs like modern processors. They outperform traditional methods regarding coverage, scalability, and efficiency. However, state-of-the-art fuzzers struggle to achieve comprehensive coverage of intricate hardware designs within a practical timeframe, often falling short of a 70% coverage threshold. We propose a novel ML-based hardware fuzzer, ChatFuzz, to address this challenge. Ourapproach leverages LLMs like ChatGPT to understand processor language, focusing on machine codes and generating assembly code sequences. RL is integrated to guide the input generation process by rewarding the inputs using code coverage metrics. We use the open-source RISCV-based RocketCore processor as our testbed. ChatFuzz achieves condition coverage rate of 75% in just 52 minutes compared to a state-of-the-art fuzzer, which requires a lengthy 30-hour window to reach a similar condition coverage. Furthermore, our fuzzer can attain 80% coverage when provided with a limited pool of 10 simulation instances/licenses within a 130-hour window. During this time, it conducted a total of 199K test cases, of which 6K produced discrepancies with the processor's golden model. Our analysis identified more than 10 unique mismatches, including two new bugs in the RocketCore and discrepancies from the RISC-V ISA Simulator.
Paper Structure (30 sections, 1 equation, 2 figures)

This paper contains 30 sections, 1 equation, 2 figures.

Figures (2)

  • Figure 1: ChatFuzz's final model results from three consequent training steps: (1) Unsupervised training based on the GPT2 model to learn the inner structure of the machine language; (2) Utilizing a disassembler as a scoring agent during PPO-based RL training, the initial model is refined by cleaning up the learned language and removing bad combinations of instructions; (3) Improving the coverage with a PPO-based RL process where the refined generator is trained through a reward function based on coverage information attained through RTL simulation.
  • Figure 2: Coverage analysis of TheHuzz thehuzz and ChatFuzz over time for RocketCore.