Table of Contents
Fetching ...

Large Language Model Integration with Reinforcement Learning to Augment Decision-Making in Autonomous Cyber Operations

Konur Tholl, François Rivest, Mariam El Mezouar, Adrian Taylor, Ranwa Al Mallah

TL;DR

By guiding initial training with an LLM, this study improves baseline performance and reduces the need for exploratory actions with obviously negative outcomes and achieves over 2x higher rewards during early training and converges to a favorable policy approximately 4,500 episodes faster than the baseline.

Abstract

Reinforcement Learning (RL) has shown great potential for autonomous decision-making in the cybersecurity domain, enabling agents to learn through direct environment interaction. However, RL agents in Autonomous Cyber Operations (ACO) typically learn from scratch, requiring them to execute undesirable actions to learn their consequences. In this study, we integrate external knowledge in the form of a Large Language Model (LLM) pretrained on cybersecurity data that our RL agent can directly leverage to make informed decisions. By guiding initial training with an LLM, we improve baseline performance and reduce the need for exploratory actions with obviously negative outcomes. We evaluate our LLM-integrated approach in a simulated cybersecurity environment, and demonstrate that our guided agent achieves over 2x higher rewards during early training and converges to a favorable policy approximately 4,500 episodes faster than the baseline.

Large Language Model Integration with Reinforcement Learning to Augment Decision-Making in Autonomous Cyber Operations

TL;DR

By guiding initial training with an LLM, this study improves baseline performance and reduces the need for exploratory actions with obviously negative outcomes and achieves over 2x higher rewards during early training and converges to a favorable policy approximately 4,500 episodes faster than the baseline.

Abstract

Reinforcement Learning (RL) has shown great potential for autonomous decision-making in the cybersecurity domain, enabling agents to learn through direct environment interaction. However, RL agents in Autonomous Cyber Operations (ACO) typically learn from scratch, requiring them to execute undesirable actions to learn their consequences. In this study, we integrate external knowledge in the form of a Large Language Model (LLM) pretrained on cybersecurity data that our RL agent can directly leverage to make informed decisions. By guiding initial training with an LLM, we improve baseline performance and reduce the need for exploratory actions with obviously negative outcomes. We evaluate our LLM-integrated approach in a simulated cybersecurity environment, and demonstrate that our guided agent achieves over 2x higher rewards during early training and converges to a favorable policy approximately 4,500 episodes faster than the baseline.

Paper Structure

This paper contains 34 sections, 8 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Initial prompt design for evaluating LLMs. For clarity, the components of the prompt are color-coded.
  • Figure 2: Overview of the process used to select an LLM. Step 1 involved creating a dataset of questions and corresponding answers to evaluate the LLMs. In Step 2, the LLMs' predictions were recorded and evaluated against this dataset using BERTScore to compute precision, recall, and F1 scores. In Step 3, the results from Step 2 were manually reviewed, and the best-performing LLM was selected.
  • Figure 3: Overview of transforming CybORG's raw state into a coherent prompt, generating a response with the LLM, and extracting the corresponding action.
  • Figure 4: Diagram illustrating the integration of the LLM into the RL pipeline. The LLM's guidance is applied using action masking at inference and as an auxiliary loss signal during training. Frozen indicates that the LLM's parameters remain unchanged throughout training. To keep the diagram concise, the critic network is omitted, and some terms are presented in an abbreviated form.
  • Figure 5: Comparison of feature space modification combinations against the PPO baseline across 10 independent runs. Shaded regions represent a ±1 SE (using the running average).
  • ...and 6 more figures