Table of Contents
Fetching ...

Discovering Command and Control Channels Using Reinforcement Learning

Cheng Wang, Akshay Kakkar, Christopher Redino, Abdul Rahman, Ajinsyam S, Ryan Clark, Daniel Radke, Tyler Cody, Lanxiao Huang, Edward Bowen

TL;DR

This work tackles the problem of identifying potential C2 channels by framing C2 traffic as a three-stage reinforcement learning problem (infection, connection, exfiltration) and modeling defender dynamics, particularly firewalls, within an MDP. The authors implement a PPO-based RL agent that operates on a detailed state space and action set, optimizing a reward structure that balances successful exfiltration against detection costs. The approach is validated on a large synthetic network (over 100 subnets and 1400+ hosts), where the agent learns efficient attack paths while evading firewall defenses, achieving substantial exfiltration within reasonable time frames. The findings demonstrate the practical utility of RL for red/purple team planning and for blue teams to anticipate and mitigate high-risk C2 pathways, with future work exploring different exfiltration protocols and more advanced defense-in-depth models.

Abstract

Command and control (C2) paths for issuing commands to malware are sometimes the only indicators of its existence within networks. Identifying potential C2 channels is often a manually driven process that involves a deep understanding of cyber tradecraft. Efforts to improve discovery of these channels through using a reinforcement learning (RL) based approach that learns to automatically carry out C2 attack campaigns on large networks, where multiple defense layers are in place serves to drive efficiency for network operators. In this paper, we model C2 traffic flow as a three-stage process and formulate it as a Markov decision process (MDP) with the objective to maximize the number of valuable hosts whose data is exfiltrated. The approach also specifically models payload and defense mechanisms such as firewalls which is a novel contribution. The attack paths learned by the RL agent can in turn help the blue team identify high-priority vulnerabilities and develop improved defense strategies. The method is evaluated on a large network with more than a thousand hosts and the results demonstrate that the agent can effectively learn attack paths while avoiding firewalls.

Discovering Command and Control Channels Using Reinforcement Learning

TL;DR

This work tackles the problem of identifying potential C2 channels by framing C2 traffic as a three-stage reinforcement learning problem (infection, connection, exfiltration) and modeling defender dynamics, particularly firewalls, within an MDP. The authors implement a PPO-based RL agent that operates on a detailed state space and action set, optimizing a reward structure that balances successful exfiltration against detection costs. The approach is validated on a large synthetic network (over 100 subnets and 1400+ hosts), where the agent learns efficient attack paths while evading firewall defenses, achieving substantial exfiltration within reasonable time frames. The findings demonstrate the practical utility of RL for red/purple team planning and for blue teams to anticipate and mitigate high-risk C2 pathways, with future work exploring different exfiltration protocols and more advanced defense-in-depth models.

Abstract

Command and control (C2) paths for issuing commands to malware are sometimes the only indicators of its existence within networks. Identifying potential C2 channels is often a manually driven process that involves a deep understanding of cyber tradecraft. Efforts to improve discovery of these channels through using a reinforcement learning (RL) based approach that learns to automatically carry out C2 attack campaigns on large networks, where multiple defense layers are in place serves to drive efficiency for network operators. In this paper, we model C2 traffic flow as a three-stage process and formulate it as a Markov decision process (MDP) with the objective to maximize the number of valuable hosts whose data is exfiltrated. The approach also specifically models payload and defense mechanisms such as firewalls which is a novel contribution. The attack paths learned by the RL agent can in turn help the blue team identify high-priority vulnerabilities and develop improved defense strategies. The method is evaluated on a large network with more than a thousand hosts and the results demonstrate that the agent can effectively learn attack paths while avoiding firewalls.
Paper Structure (18 sections, 7 equations, 6 figures, 7 tables)

This paper contains 18 sections, 7 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Command and control attack as a three-stage process.
  • Figure 2: An example network with firewalls.
  • Figure 3: Average episode rewards over 5 runs.
  • Figure 4: Average episode length over 5 runs.
  • Figure 5: Times of upload actions taken during a C2 attack.
  • ...and 1 more figures