Table of Contents
Fetching ...

Effectiveness of Tree-based Ensembles for Anomaly Discovery: Insights, Batch and Streaming Active Learning

Shubhomoy Das, Md Rakibul Islam, Nitthilan Kannappan Jayakodi, Janardhan Rao Doppa

TL;DR

This work investigates how tree-based ensembles can power efficient, human-in-the-loop anomaly discovery in both batch and streaming contexts. It reveals why averaging ensemble scores with a uniform prior and greedy top-score queries yield label-efficient identification of true anomalies, and formalizes this through HiLAD. The authors introduce Compact Description to describe and diversify discovered anomalies, and develop HiLAD-Batch and HiLAD-Stream to handle batch and streaming data with drift-detection and adaptive updates. Empirical results across ten datasets show significant gains over unsupervised baselines, improved diversity without loss of discovery rate, and competitive streaming performance, underscoring the practical value of these methods for real-world anomaly detection tasks.

Abstract

In many real-world AD applications including computer security and fraud prevention, the anomaly detector must be configurable by the human analyst to minimize the effort on false positives. One important way to configure the detector is by providing true labels (nominal or anomaly) for a few instances. Recent work on active anomaly discovery has shown that greedily querying the top-scoring instance and tuning the weights of ensemble detectors based on label feedback allows us to quickly discover true anomalies. This paper makes four main contributions to improve the state-of-the-art in anomaly discovery using tree-based ensembles. First, we provide an important insight that explains the practical successes of unsupervised tree-based ensembles and active learning based on greedy query selection strategy. We also present empirical results on real-world data to support our insights and theoretical analysis to support active learning. Second, we develop a novel batch active learning algorithm to improve the diversity of discovered anomalies based on a formalism called compact description to describe the discovered anomalies. Third, we develop a novel active learning algorithm to handle streaming data setting. We present a data drift detection algorithm that not only detects the drift robustly, but also allows us to take corrective actions to adapt the anomaly detector in a principled manner. Fourth, we present extensive experiments to evaluate our insights and our tree-based active anomaly discovery algorithms in both batch and streaming data settings. Our results show that active learning allows us to discover significantly more anomalies than state-of-the-art unsupervised baselines, our batch active learning algorithm discovers diverse anomalies, and our algorithms under the streaming-data setup are competitive with the batch setup.

Effectiveness of Tree-based Ensembles for Anomaly Discovery: Insights, Batch and Streaming Active Learning

TL;DR

This work investigates how tree-based ensembles can power efficient, human-in-the-loop anomaly discovery in both batch and streaming contexts. It reveals why averaging ensemble scores with a uniform prior and greedy top-score queries yield label-efficient identification of true anomalies, and formalizes this through HiLAD. The authors introduce Compact Description to describe and diversify discovered anomalies, and develop HiLAD-Batch and HiLAD-Stream to handle batch and streaming data with drift-detection and adaptive updates. Empirical results across ten datasets show significant gains over unsupervised baselines, improved diversity without loss of discovery rate, and competitive streaming performance, underscoring the practical value of these methods for real-world anomaly detection tasks.

Abstract

In many real-world AD applications including computer security and fraud prevention, the anomaly detector must be configurable by the human analyst to minimize the effort on false positives. One important way to configure the detector is by providing true labels (nominal or anomaly) for a few instances. Recent work on active anomaly discovery has shown that greedily querying the top-scoring instance and tuning the weights of ensemble detectors based on label feedback allows us to quickly discover true anomalies. This paper makes four main contributions to improve the state-of-the-art in anomaly discovery using tree-based ensembles. First, we provide an important insight that explains the practical successes of unsupervised tree-based ensembles and active learning based on greedy query selection strategy. We also present empirical results on real-world data to support our insights and theoretical analysis to support active learning. Second, we develop a novel batch active learning algorithm to improve the diversity of discovered anomalies based on a formalism called compact description to describe the discovered anomalies. Third, we develop a novel active learning algorithm to handle streaming data setting. We present a data drift detection algorithm that not only detects the drift robustly, but also allows us to take corrective actions to adapt the anomaly detector in a principled manner. Fourth, we present extensive experiments to evaluate our insights and our tree-based active anomaly discovery algorithms in both batch and streaming data settings. Our results show that active learning allows us to discover significantly more anomalies than state-of-the-art unsupervised baselines, our batch active learning algorithm discovers diverse anomalies, and our algorithms under the streaming-data setup are competitive with the batch setup.

Paper Structure

This paper contains 23 sections, 2 theorems, 3 equations, 23 figures, 11 tables, 8 algorithms.

Key Result

Proposition 4.1

Let $\delta\in[0, 1]$. For the 2D case, the number of labels needed to learn the decision boundary with probability $(1-\delta)$ with pool-based active learning is $T = O(\log(\frac{1}{\sigma})\frac{1}{p_{\theta}}\log(\frac{1}{\delta}))$.

Figures (23)

  • Figure 1: High-level overview of the human-in-the-loop learning framework for anomaly detection. Our goal is to maximize the number of true anomalies presented to the analyst.
  • Figure 2: Illustration of Isolation tree from das:2017.
  • Figure 3: Illustration of Isolation Tree on simple data. (a) Toy dataset das:2017. (b) A single isolation tree for the Toy dataset. (c) Regions having deeper red belong to leaf nodes which have shorter path lengths from the root and correspondingly, higher anomaly scores. Regions having deeper blue correspond to longer path lengths and lower anomaly scores.
  • Figure 4: Illustration of differences among different tree-based ensembles. The red rectangles show the union of the $5$ most anomalous subspaces across each of the $15$ most anomalous instances (blue). These subspaces have the highest influence in propagating feedback across instances through gradient-based learning under our model. HST has fixed depth which needs to be high for accuracy (recommended $15$tan:2011). IFOR has adaptive height and most anomalous subspaces are shallow. Higher depths are associated with smaller subspaces which are shared by fewer instances. As a result, feedback on any individual instance gets passed on to many other instances in IFOR, but to fewer instances in HST. RSF has similar behavior as HST. We set the depth for HST (and RSF wu:2014) to $8$ (Figure \ref{['fig:hstrees_regions_8']}) in our experiments in order to balance accuracy and feedback efficiency.
  • Figure 5: Illustration of candidate score distributions from an ensemble in 2D. The two axes represent two different ensemble members. (a) C1 represents the common case where both anomaly detectors want to score anomalous data points higher, (b) C2 illustrates how active learning helps the model to learn the slight angle deviation, $\theta$, (c) C3 is specifically for IFOR case where the anomalous data points has smaller path length. So, anomalous data points will be located at two extremes where the path lengths are smallest and can be separated by the non-homogeneous hyperplane.
  • ...and 18 more figures

Theorems & Definitions (2)

  • Proposition 4.1
  • Proposition 4.2