L2M-AID: Autonomous Cyber-Physical Defense by Fusing Semantic Reasoning of Large Language Models with Multi-Agent Reinforcement Learning (Preprint)
Tianxiang Xu, Zhichao Wen, Xinyu Zhao, Jun Wang, Yan Li, Chang Liu
TL;DR
Addressing the need for context-aware ICS security, this paper proposes L2M-AID, a hierarchical system that fuses LLM-based semantic reasoning with cooperative MAPPO-based MARL. It formalizes defense as a Dec-POMDP with semantically enriched state representations generated by an Orchestrator LLM and optimized via centralized-training, decentralized-execution MAPPO with a global reward $\mathcal{R}(s,\mathbf{a})$ balancing security and process safety. Validation on the SWaT ICS benchmark and a synthetic MITRE ATT&CK for ICS-derived dataset shows superior detection rates, lower false positives, faster responses, and improved process stability, with ablations confirming the critical role of the semantic embedding and multi-agent coordination. The work demonstrates a robust, autonomous defense paradigm capable of protecting critical infrastructure and provides a foundation for future improvements in sim-to-real transfer, adversarial resilience, and explainability.
Abstract
The increasing integration of Industrial IoT (IIoT) exposes critical cyber-physical systems to sophisticated, multi-stage attacks that elude traditional defenses lacking contextual awareness. This paper introduces L2M-AID, a novel framework for Autonomous Industrial Defense using LLM-empowered, Multi-agent reinforcement learning. L2M-AID orchestrates a team of collaborative agents, each driven by a Large Language Model (LLM), to achieve adaptive and resilient security. The core innovation lies in the deep fusion of two AI paradigms: we leverage an LLM as a semantic bridge to translate vast, unstructured telemetry into a rich, contextual state representation, enabling agents to reason about adversary intent rather than merely matching patterns. This semantically-aware state empowers a Multi-Agent Reinforcement Learning (MARL) algorithm, MAPPO, to learn complex cooperative strategies. The MARL reward function is uniquely engineered to balance security objectives (threat neutralization) with operational imperatives, explicitly penalizing actions that disrupt physical process stability. To validate our approach, we conduct extensive experiments on the benchmark SWaT dataset and a novel synthetic dataset generated based on the MITRE ATT&CK for ICS framework. Results demonstrate that L2M-AID significantly outperforms traditional IDS, deep learning anomaly detectors, and single-agent RL baselines across key metrics, achieving a 97.2% detection rate while reducing false positives by over 80% and improving response times by a factor of four. Crucially, it demonstrates superior performance in maintaining physical process stability, presenting a robust new paradigm for securing critical national infrastructure.
