Robust Isolation Forest using Soft Sparse Random Projection and Valley Emphasis Method
Hun Kang, Kyoungok Kim
TL;DR
RiForest addresses the inconsistent performance of prior iForest variants by jointly leveraging original features and soft sparse random projections to form a diverse hyperplane set, and by using the valley emphasis method to determine split points. The method introduces dimension entropy-based feature selection and a variable path-length scheme to sharpen anomaly scores, achieving superior stability and robustness to noisy variables across 24 benchmark datasets. Across extensive experiments, RiForest demonstrates strong AUROC performance and lower variability compared with baselines, with ablation analysis confirming the value of its components, especially the valley-based split. This approach offers a practical, dataset-agnostic improvement for unsupervised anomaly detection in diverse domains, reducing sensitivity to noise and distributional differences.
Abstract
Isolation Forest (iForest) is an unsupervised anomaly detection algorithm designed to effectively detect anomalies under the assumption that anomalies are ``few and different." Various studies have aimed to enhance iForest, but the resulting algorithms often exhibited significant performance disparities across datasets. Additionally, the challenge of isolating rare and widely distributed anomalies persisted in research focused on improving splits. To address these challenges, we introduce Robust iForest (RiForest). RiForest leverages both existing features and random hyperplanes obtained through soft sparse random projection to identify superior split features for anomaly detection, independent of datasets. It utilizes the underutilized valley emphasis method for optimal split point determination and incorporates sparsity randomization in soft sparse random projection for enhanced anomaly detection robustness. Across 24 benchmark datasets, experiments demonstrate RiForest's consistent outperformance of existing algorithms in anomaly detection, emphasizing stability and robustness to noise variables.
