LLM-based Multi-class Attack Analysis and Mitigation Framework in IoT/IIoT Networks
Seif Ikbarieh, Maanak Gupta, Elmahedi Mahalal
TL;DR
The paper tackles the lack of quantitative benchmarks for AI-driven IoT security analyses by proposing a hybrid framework that combines ML-based attack detection with LLM-driven attack behavior analysis and mitigation suggestions. It benchmarks multiple ML/DL classifiers on Edge-IIoTset and CICIoT2023 to choose the best detector (Random Forest) and uses Retrieval-Augmented Generation with prompt engineering to ground LLM analyses in attack and device context. An ensemble of judge LLMs plus human experts provides objective scoring across four evaluation dimensions, enabling quantitative comparison of LLMs like ChatGPT-o3 and DeepSeek-R1. The results show that RF offers superior detection performance and ChatGPT-o3 provides more accurate, practical analyses and mitigations across 13 attack types, highlighting the potential for scalable, grounded, AI-assisted IoT security.
Abstract
The Internet of Things has expanded rapidly, transforming communication and operations across industries but also increasing the attack surface and security breaches. Artificial Intelligence plays a key role in securing IoT, enabling attack detection, attack behavior analysis, and mitigation suggestion. Despite advancements, evaluations remain purely qualitative, and the lack of a standardized, objective benchmark for quantitatively measuring AI-based attack analysis and mitigation hinders consistent assessment of model effectiveness. In this work, we propose a hybrid framework combining Machine Learning (ML) for multi-class attack detection with Large Language Models (LLMs) for attack behavior analysis and mitigation suggestion. After benchmarking several ML and Deep Learning (DL) classifiers on the Edge-IIoTset and CICIoT2023 datasets, we applied structured role-play prompt engineering with Retrieval-Augmented Generation (RAG) to guide ChatGPT-o3 and DeepSeek-R1 in producing detailed, context-aware responses. We introduce novel evaluation metrics for quantitative assessment to guide us and an ensemble of judge LLMs, namely ChatGPT-4o, DeepSeek-V3, Mixtral 8x7B Instruct, Gemini 2.5 Flash, Meta Llama 4, TII Falcon H1 34B Instruct, xAI Grok 3, and Claude 4 Sonnet, to independently evaluate the responses. Results show that Random Forest has the best detection model, and ChatGPT-o3 outperformed DeepSeek-R1 in attack analysis and mitigation.
