Table of Contents
Fetching ...

Multi-Agent LLM Governance for Safe Two-Timescale Reinforcement Learning in SDN-IoT Defense

Saeid Jamshidi, Negar Shahabi, Foutse Khomh, Carol Fung, Mohammad Hamdaqa

Abstract

Software-Defined Networking (SDN) is increasingly adopted to secure Internet-of-Things (IoT) networks due to its centralized control and programmable forwarding. However, SDN-IoT defense is inherently a closed-loop control problem in which mitigation actions impact controller workload, queue dynamics, rule-installation delay, and future traffic observations. Aggressive mitigation may destabilize the control plane, degrade Quality of Service (QoS), and amplify systemic risk. Existing learning-based approaches prioritize detection accuracy while neglecting controller coupling and short-horizon Reinforcement Learning (RL) optimization without structured, auditable policy evolution. This paper introduces a self-reflective two-timescale SDN-IoT defense solution separating fast mitigation from slow policy governance. At the fast timescale, per-switch Proximal Policy Optimization (PPO) agents perform controller-aware mitigation under safety constraints and action masking. At the slow timescale, a multi-agent Large Language Model (LLM) governance engine generates machine-parsable updates to the global policy constitution Pi, which encodes admissible actions, safety thresholds, and reward priorities. Updates (Delta Pi) are validated through stress testing and deployed only with non-regression and safety guarantees, ensuring an auditable evolution without retraining RL agents. Evaluation under heterogeneous IoT traffic and adversarial stress shows improvements of 9.1% Macro-F1 over PPO and 15.4% over static baselines. Worst-case degradation drops by 36.8%, controller backlog peaks by 42.7%, and RTT p95 inflation remains below 5.8% under high-intensity attacks. Policy evolution converges within five cycles, reducing catastrophic overload from 11.6% to 2.3%.

Multi-Agent LLM Governance for Safe Two-Timescale Reinforcement Learning in SDN-IoT Defense

Abstract

Software-Defined Networking (SDN) is increasingly adopted to secure Internet-of-Things (IoT) networks due to its centralized control and programmable forwarding. However, SDN-IoT defense is inherently a closed-loop control problem in which mitigation actions impact controller workload, queue dynamics, rule-installation delay, and future traffic observations. Aggressive mitigation may destabilize the control plane, degrade Quality of Service (QoS), and amplify systemic risk. Existing learning-based approaches prioritize detection accuracy while neglecting controller coupling and short-horizon Reinforcement Learning (RL) optimization without structured, auditable policy evolution. This paper introduces a self-reflective two-timescale SDN-IoT defense solution separating fast mitigation from slow policy governance. At the fast timescale, per-switch Proximal Policy Optimization (PPO) agents perform controller-aware mitigation under safety constraints and action masking. At the slow timescale, a multi-agent Large Language Model (LLM) governance engine generates machine-parsable updates to the global policy constitution Pi, which encodes admissible actions, safety thresholds, and reward priorities. Updates (Delta Pi) are validated through stress testing and deployed only with non-regression and safety guarantees, ensuring an auditable evolution without retraining RL agents. Evaluation under heterogeneous IoT traffic and adversarial stress shows improvements of 9.1% Macro-F1 over PPO and 15.4% over static baselines. Worst-case degradation drops by 36.8%, controller backlog peaks by 42.7%, and RTT p95 inflation remains below 5.8% under high-intensity attacks. Policy evolution converges within five cycles, reducing catastrophic overload from 11.6% to 2.3%.

Paper Structure

This paper contains 26 sections, 40 equations, 17 figures, 10 tables, 2 algorithms.

Figures (17)

  • Figure 1: Architecture of the proposed self-reflective SDN-IoT control solution. Independent per-switch PPO agents enable fast, decentralized mitigation, while the centralized controller mediates shared-control-plane dynamics and enforces safety constraints. A multi-agent LLM governance engine (Critic, Compiler, Red-Team, Judge, and Memory roles) operates on a slower timescale to generate auditable policy updates that modify the global policy entity $\Pi$ and guide the safe evolution of the system.
  • Figure 2: Experimental architecture of the self-reflective SDN-IoT network. Local RL agents run on SDN switches; the controller aggregates telemetry and enforces flow rules; and the multi-agent LLM engine performs structured policy evolution based on controller logs.
  • Figure 3: Security stabilization dynamics under the Fast configuration (PPO clip $\epsilon=0.30$). Steps per episode indicate sustained adversarial containment without control-plane overload.
  • Figure 4: Security stabilization dynamics under the Best Performance configuration (PPO clip $\epsilon=0.20$). The highest asymptotic episode length reflects extended safe containment under adversarial traffic while preserving controller stability.
  • Figure 5: Security stabilization dynamics under the Stable configuration (PPO clip $\epsilon=0.10$). Reduced cross-run variance indicates robust mitigation behavior under stochastic traffic with bounded control-plane impact.
  • ...and 12 more figures