Table of Contents
Fetching ...

siForest: Detecting Network Anomalies with Set-Structured Isolation Forest

Christie Djidjev

TL;DR

This work tackles anomaly detection in set-structured network scan data by introducing siForest, a Set-Partitioned Isolation Forest that preserves IP-level groupings and uses IP-based aggregation for anomaly scoring. It evaluates siForest against standard iForest variants using two preprocessing strategies—flattening and summarization—on synthetic data reflecting realistic Censys-like scans. The results show siForest offers robust, cross-type performance, particularly excelling where port-service relationships are crucial, while preprocessing choices influence performance for specific anomaly types. The study demonstrates siForest's potential as a practical tool for attack-surface identification in cybersecurity contexts, with future directions including real-world validation and integration with graph-based techniques.

Abstract

As cyber threats continue to evolve in sophistication and scale, the ability to detect anomalous network behavior has become critical for maintaining robust cybersecurity defenses. Modern cybersecurity systems face the overwhelming challenge of analyzing billions of daily network interactions to identify potential threats, making efficient and accurate anomaly detection algorithms crucial for network defense. This paper investigates the use of variations of the Isolation Forest (iForest) machine learning algorithm for detecting anomalies in internet scan data. In particular, it presents the Set-Partitioned Isolation Forest (siForest), a novel extension of the iForest method designed to detect anomalies in set-structured data. By treating instances such as sets of multiple network scans with the same IP address as cohesive units, siForest effectively addresses some challenges of analyzing complex, multidimensional datasets. Extensive experiments on synthetic datasets simulating diverse anomaly scenarios in network traffic demonstrate that siForest has the potential to outperform traditional approaches on some types of internet scan data.

siForest: Detecting Network Anomalies with Set-Structured Isolation Forest

TL;DR

This work tackles anomaly detection in set-structured network scan data by introducing siForest, a Set-Partitioned Isolation Forest that preserves IP-level groupings and uses IP-based aggregation for anomaly scoring. It evaluates siForest against standard iForest variants using two preprocessing strategies—flattening and summarization—on synthetic data reflecting realistic Censys-like scans. The results show siForest offers robust, cross-type performance, particularly excelling where port-service relationships are crucial, while preprocessing choices influence performance for specific anomaly types. The study demonstrates siForest's potential as a practical tool for attack-surface identification in cybersecurity contexts, with future directions including real-world validation and integration with graph-based techniques.

Abstract

As cyber threats continue to evolve in sophistication and scale, the ability to detect anomalous network behavior has become critical for maintaining robust cybersecurity defenses. Modern cybersecurity systems face the overwhelming challenge of analyzing billions of daily network interactions to identify potential threats, making efficient and accurate anomaly detection algorithms crucial for network defense. This paper investigates the use of variations of the Isolation Forest (iForest) machine learning algorithm for detecting anomalies in internet scan data. In particular, it presents the Set-Partitioned Isolation Forest (siForest), a novel extension of the iForest method designed to detect anomalies in set-structured data. By treating instances such as sets of multiple network scans with the same IP address as cohesive units, siForest effectively addresses some challenges of analyzing complex, multidimensional datasets. Extensive experiments on synthetic datasets simulating diverse anomaly scenarios in network traffic demonstrate that siForest has the potential to outperform traditional approaches on some types of internet scan data.

Paper Structure

This paper contains 14 sections, 6 figures.

Figures (6)

  • Figure 1: Illustration of how the distance to the root in an Isolation Forest tree is used to detect outliers. Shorter distances correspond to points that are easier to isolate and are therefore more likely to be anomalous.
  • Figure 2: Comparison of the number of features and rows across three different data representations: original, flattening, and summarization. The original data contains 10 IP addresses with 20 scans each, resulting in 200 rows and 3 features (IP, port list, and service list).
  • Figure 3: Comparison of Isolation Forest approaches showing datapoint types, features, and binary tree partitioning for the three methods considered in this paper.
  • Figure 4: Bars in red indicate anomalous IP scans with much higher usage compared to normal scans.
  • Figure 5: Red bars highlight anomalous behavior based on atypical port usage.
  • ...and 1 more figures