Table of Contents
Fetching ...

A Mallows-like Criterion for Anomaly Detection with Random Forest Implementation

Gaoxiang Zhao, Lu Wang, Xiaoqiang Wang

TL;DR

This work addresses anomaly detection under model uncertainty and extreme class imbalance by proposing a Mallows-like focal loss (MFL) criterion to optimize ensemble weights $\boldsymbol{\omega}$ on the model-averaging simplex. It integrates the MFL criterion into Random Forest, training $M$ base trees and solving for $\boldsymbol{\omega}^*$ to form a weighted ensemble, with a complexity penalty reflecting base-model size $k_m$ and focal-loss parameters $(\alpha,\gamma)$. Empirical results on the KDDCup network intrusion dataset and ten imbalanced UCI datasets show that MFL-based RF improves AUC, ARI, and Recall relative to cross-entropy model averaging and several standard anomaly detectors, indicating improved accuracy and robustness. The approach advances anomaly detection by explicitly addressing data imbalance and model uncertainty through a principled, regularized, model-averaging framework and offers practical gains for cybersecurity and other domains.

Abstract

The effectiveness of anomaly signal detection can be significantly undermined by the inherent uncertainty of relying on one specified model. Under the framework of model average methods, this paper proposes a novel criterion to select the weights on aggregation of multiple models, wherein the focal loss function accounts for the classification of extremely imbalanced data. This strategy is further integrated into Random Forest algorithm by replacing the conventional voting method. We have evaluated the proposed method on benchmark datasets across various domains, including network intrusion. The findings indicate that our proposed method not only surpasses the model averaging with typical loss functions but also outstrips common anomaly detection algorithms in terms of accuracy and robustness.

A Mallows-like Criterion for Anomaly Detection with Random Forest Implementation

TL;DR

This work addresses anomaly detection under model uncertainty and extreme class imbalance by proposing a Mallows-like focal loss (MFL) criterion to optimize ensemble weights on the model-averaging simplex. It integrates the MFL criterion into Random Forest, training base trees and solving for to form a weighted ensemble, with a complexity penalty reflecting base-model size and focal-loss parameters . Empirical results on the KDDCup network intrusion dataset and ten imbalanced UCI datasets show that MFL-based RF improves AUC, ARI, and Recall relative to cross-entropy model averaging and several standard anomaly detectors, indicating improved accuracy and robustness. The approach advances anomaly detection by explicitly addressing data imbalance and model uncertainty through a principled, regularized, model-averaging framework and offers practical gains for cybersecurity and other domains.

Abstract

The effectiveness of anomaly signal detection can be significantly undermined by the inherent uncertainty of relying on one specified model. Under the framework of model average methods, this paper proposes a novel criterion to select the weights on aggregation of multiple models, wherein the focal loss function accounts for the classification of extremely imbalanced data. This strategy is further integrated into Random Forest algorithm by replacing the conventional voting method. We have evaluated the proposed method on benchmark datasets across various domains, including network intrusion. The findings indicate that our proposed method not only surpasses the model averaging with typical loss functions but also outstrips common anomaly detection algorithms in terms of accuracy and robustness.
Paper Structure (6 sections, 7 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 6 sections, 7 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: The schematic diagram of the proposed model averaging method is presented. This approach minimizes MFL criterion to allocate weights to base decision trees, mitigating the effects of data imbalance while controlling model complexity.
  • Figure 2: Network intrusion dataset methodology metrics.