Labeling NIDS Rules with MITRE ATT&CK Techniques: Machine Learning vs. Large Language Models

Nir Daniel; Florian Klaus Kaiser; Shay Giladi; Sapir Sharabi; Raz Moyal; Shalev Shpolyansky; Andres Murillo; Aviad Elyashar; Rami Puzis

Labeling NIDS Rules with MITRE ATT&CK Techniques: Machine Learning vs. Large Language Models

Nir Daniel, Florian Klaus Kaiser, Shay Giladi, Sapir Sharabi, Raz Moyal, Shalev Shpolyansky, Andres Murillo, Aviad Elyashar, Rami Puzis

TL;DR

This work benchmarks labeling NIDS rules with MITRE ATT&CK techniques by comparing three prominent LLMs (ChatGPT, Claude, Gemini) against ML models trained on TF-IDF features, using a dataset of 973 Snort rules mapped to 75 ATT&CK techniques. It formalizes the task as conditional text generation for LLMs and a supervised multi-label classification pipeline for ML, with a balanced 80/20 train/test split. Across configurations, ML models achieve higher precision, recall, and F1 than LLMs (e.g., technique F1 up to 0.87 and tactic F1 up to 0.92), while LLMs provide explainable mappings and flexible reasoning, especially when prompt templates combine contextual and example-driven guidance (T-$ICL_2$). The study also delivers a resource: a labeled NIDS rule dataset for future benchmarking, and argues for hybrid LLM-ML approaches to harness explainability and high accuracy for SOC workflows and evolving threat landscapes.

Abstract

Analysts in Security Operations Centers (SOCs) are often occupied with time-consuming investigations of alerts from Network Intrusion Detection Systems (NIDS). Many NIDS rules lack clear explanations and associations with attack techniques, complicating the alert triage and the generation of attack hypotheses. Large Language Models (LLMs) may be a promising technology to reduce the alert explainability gap by associating rules with attack techniques. In this paper, we investigate the ability of three prominent LLMs (ChatGPT, Claude, and Gemini) to reason about NIDS rules while labeling them with MITRE ATT&CK tactics and techniques. We discuss prompt design and present experiments performed with 973 Snort rules. Our results indicate that while LLMs provide explainable, scalable, and efficient initial mappings, traditional Machine Learning (ML) models consistently outperform them in accuracy, achieving higher precision, recall, and F1-scores. These results highlight the potential for hybrid LLM-ML approaches to enhance SOC operations and better address the evolving threat landscape.

Labeling NIDS Rules with MITRE ATT&CK Techniques: Machine Learning vs. Large Language Models

TL;DR

Abstract

Labeling NIDS Rules with MITRE ATT&CK Techniques: Machine Learning vs. Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)