Table of Contents
Fetching ...

μRL: Discovering Transient Execution Vulnerabilities Using Reinforcement Learning

M. Caner Tol, Kemal Derya, Berk Sunar

TL;DR

μRL introduces a reinforcement learning framework to autonomously discover microarchitectural vulnerabilities by directing instruction-space exploration with feedback from CPU performance counters. Using a PPO-based agent and a hierarchical action space, it learns to generate instruction sequences that reveal transient execution leaks, validated on Intel Skylake-X and Raptor Lake where new mechanisms such as masked FP exceptions and MMX-x87 transitions were found. The approach re-discovers known attacks and identifies eight novel transient execution patterns, with proof-of-concept exploitability demonstrated for Meltdown-style leakage without μcode assists or TSX. The results highlight the potential of adaptive, data-driven hardware-security testing to scale across architectures and potentially extend to GPUs and accelerators, enabling earlier, automated vulnerability discovery. Limitations include partial observability, sparse rewards, and the need for broader OS/mitigation context, but transfer learning and parallelism offer paths to broader applicability and scalability.

Abstract

We propose using reinforcement learning to address the challenges of discovering microarchitectural vulnerabilities, such as Spectre and Meltdown, which exploit subtle interactions in modern processors. Traditional methods like random fuzzing fail to efficiently explore the vast instruction space and often miss vulnerabilities that manifest under specific conditions. To overcome this, we introduce an intelligent, feedback-driven approach using RL. Our RL agents interact with the processor, learning from real-time feedback to prioritize instruction sequences more likely to reveal vulnerabilities, significantly improving the efficiency of the discovery process. We also demonstrate that RL systems adapt effectively to various microarchitectures, providing a scalable solution across processor generations. By automating the exploration process, we reduce the need for human intervention, enabling continuous learning that uncovers hidden vulnerabilities. Additionally, our approach detects subtle signals, such as timing anomalies or unusual cache behavior, that may indicate microarchitectural weaknesses. This proposal advances hardware security testing by introducing a more efficient, adaptive, and systematic framework for protecting modern processors. When unleashed on Intel Skylake-X and Raptor Lake microarchitectures, our RL agent was indeed able to generate instruction sequences that cause significant observable byte leakages through transient execution without generating any $μ$code assists, faults or interrupts. The newly identified leaky sequences stem from a variety of Intel instructions, e.g. including SERIALIZE, VERR/VERW, CLMUL, MMX-x87 transitions, LSL+RDSCP and LAR. These initial results give credence to the proposed approach.

μRL: Discovering Transient Execution Vulnerabilities Using Reinforcement Learning

TL;DR

μRL introduces a reinforcement learning framework to autonomously discover microarchitectural vulnerabilities by directing instruction-space exploration with feedback from CPU performance counters. Using a PPO-based agent and a hierarchical action space, it learns to generate instruction sequences that reveal transient execution leaks, validated on Intel Skylake-X and Raptor Lake where new mechanisms such as masked FP exceptions and MMX-x87 transitions were found. The approach re-discovers known attacks and identifies eight novel transient execution patterns, with proof-of-concept exploitability demonstrated for Meltdown-style leakage without μcode assists or TSX. The results highlight the potential of adaptive, data-driven hardware-security testing to scale across architectures and potentially extend to GPUs and accelerators, enabling earlier, automated vulnerability discovery. Limitations include partial observability, sparse rewards, and the need for broader OS/mitigation context, but transfer learning and parallelism offer paths to broader applicability and scalability.

Abstract

We propose using reinforcement learning to address the challenges of discovering microarchitectural vulnerabilities, such as Spectre and Meltdown, which exploit subtle interactions in modern processors. Traditional methods like random fuzzing fail to efficiently explore the vast instruction space and often miss vulnerabilities that manifest under specific conditions. To overcome this, we introduce an intelligent, feedback-driven approach using RL. Our RL agents interact with the processor, learning from real-time feedback to prioritize instruction sequences more likely to reveal vulnerabilities, significantly improving the efficiency of the discovery process. We also demonstrate that RL systems adapt effectively to various microarchitectures, providing a scalable solution across processor generations. By automating the exploration process, we reduce the need for human intervention, enabling continuous learning that uncovers hidden vulnerabilities. Additionally, our approach detects subtle signals, such as timing anomalies or unusual cache behavior, that may indicate microarchitectural weaknesses. This proposal advances hardware security testing by introducing a more efficient, adaptive, and systematic framework for protecting modern processors. When unleashed on Intel Skylake-X and Raptor Lake microarchitectures, our RL agent was indeed able to generate instruction sequences that cause significant observable byte leakages through transient execution without generating any code assists, faults or interrupts. The newly identified leaky sequences stem from a variety of Intel instructions, e.g. including SERIALIZE, VERR/VERW, CLMUL, MMX-x87 transitions, LSL+RDSCP and LAR. These initial results give credence to the proposed approach.

Paper Structure

This paper contains 48 sections, 5 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: Overview of the RL framework for $\mu$Arch vulnerability analysis; $\bigoplus$ before the Observation denotes concatenation. $\bigoplus$ before Reward represents Equation \ref{['eq:reward']}.
  • Figure 2: Test flow for detecting observable byte leakage.
  • Figure 3: Experiment Setup on Skylake-X. Each physical core shown in green was allocated to a single instruction sequence at a time.
  • Figure 4: The increase of the average reward per episode during the 10 days of agent training in Raptor Lake. An episode corresponds to the largest instruction sequence the agent can generate from scratch. The darker line shows the running mean.
  • Figure 5: The increase of the average length of generated assembly sequences during the $\sim$10 days of agent training in Raptor Lake. The darker line shows the running mean.
  • ...and 13 more figures