IT Intrusion Detection Using Statistical Learning and Testbed Measurements

Xiaoxuan Wang; Rolf Stadler

IT Intrusion Detection Using Statistical Learning and Testbed Measurements

Xiaoxuan Wang, Rolf Stadler

TL;DR

The paper investigates automated IT intrusion detection by comparing HMM, LSTM, and RFC for predicting attack start times, attack types, and action sequences using data from a realistic testbed. It introduces a machine-learning pipeline to reduce high-dimensional observations to a manageable symbol/feature set, enabling offline and online evaluation, including scenarios with unlabeled data. Results show HMM and LSTM can achieve high start-time and type accuracy, with LSTM excelling at action-sequence prediction given sufficient labeled data, while HMM offers lighter training and online applicability; RFC provides fast single-step predictions but weaker sequence performance. The study demonstrates that incorporating SNORT-based signals and careful feature reduction enhances predictive performance, and it outlines practical directions for integrating ML-based intrusion prediction with traditional IDS such as SNORT.

Abstract

We study automated intrusion detection in an IT infrastructure, specifically the problem of identifying the start of an attack, the type of attack, and the sequence of actions an attacker takes, based on continuous measurements from the infrastructure. We apply statistical learning methods, including Hidden Markov Model (HMM), Long Short-Term Memory (LSTM), and Random Forest Classifier (RFC) to map sequences of observations to sequences of predicted attack actions. In contrast to most related research, we have abundant data to train the models and evaluate their predictive power. The data comes from traces we generate on an in-house testbed where we run attacks against an emulated IT infrastructure. Central to our work is a machine-learning pipeline that maps measurements from a high-dimensional observation space to a space of low dimensionality or to a small set of observation symbols. Investigating intrusions in offline as well as online scenarios, we find that both HMM and LSTM can be effective in predicting attack start time, attack type, and attack actions. If sufficient training data is available, LSTM achieves higher prediction accuracy than HMM. HMM, on the other hand, requires less computational resources and less training data for effective prediction. Also, we find that the methods we study benefit from data produced by traditional intrusion detection systems like SNORT.

IT Intrusion Detection Using Statistical Learning and Testbed Measurements

TL;DR

Abstract

Paper Structure (18 sections, 10 equations, 5 figures, 3 tables)

This paper contains 18 sections, 10 equations, 5 figures, 3 tables.

Introduction
Use Case
Problem formulation and approach
Methods
Hidden Markov Model (HMM)
Long Short-Term Memory (LSTM)
Random Forest Classifier (RFC)
Emulating attacks and collecting measurements
Data preprocessing
Evaluation results
Evaluation metrics
Model training and evaluation
Results for offline prediction (forensics)
Results for online prediction (real-time supervision)
Discussion of the evaluation results
...and 3 more sections

Figures (5)

Figure 1: The IT infrastructure in the intrusion detection use case.
Figure 2: ML pipeline of the proposed approach. Models we consider are HMM, LSTM, and RFC.
Figure 3: Structure of Hidden Markov Model.
Figure 4: Generating a sample sequence from a testbed trace.
Figure 5: Offline training, online prediction: prediction accuracy of the attack action at the current time step in function of the observation sequence length. The results are for Attack 1.

IT Intrusion Detection Using Statistical Learning and Testbed Measurements

TL;DR

Abstract

IT Intrusion Detection Using Statistical Learning and Testbed Measurements

Authors

TL;DR

Abstract

Table of Contents

Figures (5)