Table of Contents
Fetching ...

Large Language Model-Based Reward Design for Deep Reinforcement Learning-Driven Autonomous Cyber Defense

Sayak Mukherjee, Samrat Chatterjee, Emilie Purvine, Ted Fujimoto, Tegan Emerson

TL;DR

The paper addresses reward design challenges for DRL-based autonomous cyber defense in complex, dynamic networks. It introduces an LLM-based reward design framework that uses environment context and attacker/defender personas to generate task-specific rewards, implemented with Claude Sonnet 4. The framework is integrated into a PPO-based training loop in the Cyberwheel simulator, producing a composite reward that combines blue and red agent incentives. Results show that LLM-guided reward designs enable effective defense policies, with proactive blue agents delaying first impact against diverse attacker personas and a mixed-strategy approach offering robustness and explainability.

Abstract

Designing rewards for autonomous cyber attack and defense learning agents in a complex, dynamic environment is a challenging task for subject matter experts. We propose a large language model (LLM)-based reward design approach to generate autonomous cyber defense policies in a deep reinforcement learning (DRL)-driven experimental simulation environment. Multiple attack and defense agent personas were crafted, reflecting heterogeneity in agent actions, to generate LLM-guided reward designs where the LLM was first provided with contextual cyber simulation environment information. These reward structures were then utilized within a DRL-driven attack-defense simulation environment to learn an ensemble of cyber defense policies. Our results suggest that LLM-guided reward designs can lead to effective defense strategies against diverse adversarial behaviors.

Large Language Model-Based Reward Design for Deep Reinforcement Learning-Driven Autonomous Cyber Defense

TL;DR

The paper addresses reward design challenges for DRL-based autonomous cyber defense in complex, dynamic networks. It introduces an LLM-based reward design framework that uses environment context and attacker/defender personas to generate task-specific rewards, implemented with Claude Sonnet 4. The framework is integrated into a PPO-based training loop in the Cyberwheel simulator, producing a composite reward that combines blue and red agent incentives. Results show that LLM-guided reward designs enable effective defense policies, with proactive blue agents delaying first impact against diverse attacker personas and a mixed-strategy approach offering robustness and explainability.

Abstract

Designing rewards for autonomous cyber attack and defense learning agents in a complex, dynamic environment is a challenging task for subject matter experts. We propose a large language model (LLM)-based reward design approach to generate autonomous cyber defense policies in a deep reinforcement learning (DRL)-driven experimental simulation environment. Multiple attack and defense agent personas were crafted, reflecting heterogeneity in agent actions, to generate LLM-guided reward designs where the LLM was first provided with contextual cyber simulation environment information. These reward structures were then utilized within a DRL-driven attack-defense simulation environment to learn an ensemble of cyber defense policies. Our results suggest that LLM-guided reward designs can lead to effective defense strategies against diverse adversarial behaviors.

Paper Structure

This paper contains 9 sections, 5 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Methodological overview with LLM for reward design in a cyber DRL simulation environment.
  • Figure 2: Example state space in a Cyberwheel 15-host network with one router, three subnets, and multiple hosts.
  • Figure 3: Baseline red agent yaml file with actions and rewards.
  • Figure 4: Training progress of blue agent for different scenarios against a baseline red agent.
  • Figure 5: A single episode of red agent's action steps (top) and blue agent actions. For each step in the red agent's trajectory the lower symbol corresponds to the subnet where the action originates (its source) and the upper symbol corresponds to the subnet where the action takes place (its destination).
  • ...and 3 more figures