Table of Contents
Fetching ...

RFOD: Random Forest-based Outlier Detection for Tabular Data

Yihao Ang, Peicheng Yao, Yifan Bao, Yushuo Feng, Qiang Huang, Anthony K. H. Tung, Zhiyong Huang

TL;DR

RFOD reframes tabular outlier detection as a feature-wise conditional reconstruction problem by training a dedicated Random Forest for each feature. It combines forest pruning, Adjusted Gower's Distance to score cell-level deviations, and Uncertainty-Weighted Averaging to produce robust row-level anomaly scores with fine-grained interpretability. Across 15 real-world datasets, RFOD delivers strong detection performance, outperforming both data-mining and deep-learning baselines, while achieving favorable efficiency through parallelizable training and pruning. The approach directly handles mixed-type data without lossy encoding and provides actionable explanations at the cell and row levels, making it well-suited for high-stakes domains. The work demonstrates RFOD's scalability and robustness, supported by ablation studies and case analyses that highlight the practical value of interpretable anomaly localization.

Abstract

Outlier detection in tabular data is crucial for safeguarding data integrity in high-stakes domains such as cybersecurity, financial fraud detection, and healthcare, where anomalies can cause serious operational and economic impacts. Despite advances in both data mining and deep learning, many existing methods struggle with mixed-type tabular data, often relying on encoding schemes that lose important semantic information. Moreover, they frequently lack interpretability, offering little insight into which specific values cause anomalies. To overcome these challenges, we introduce \textsf{\textbf{RFOD}}, a novel \textsf{\textbf{R}}andom \textsf{\textbf{F}}orest-based \textsf{\textbf{O}}utlier \textsf{\textbf{D}}etection framework tailored for tabular data. Rather than modeling a global joint distribution, \textsf{RFOD} reframes anomaly detection as a feature-wise conditional reconstruction problem, training dedicated random forests for each feature conditioned on the others. This design robustly handles heterogeneous data types while preserving the semantic integrity of categorical features. To further enable precise and interpretable detection, \textsf{RFOD} combines Adjusted Gower's Distance (AGD) for cell-level scoring, which adapts to skewed numerical data and accounts for categorical confidence, with Uncertainty-Weighted Averaging (UWA) to aggregate cell-level scores into robust row-level anomaly scores. Extensive experiments on 15 real-world datasets demonstrate that \textsf{RFOD} consistently outperforms state-of-the-art baselines in detection accuracy while offering superior robustness, scalability, and interpretability for mixed-type tabular data.

RFOD: Random Forest-based Outlier Detection for Tabular Data

TL;DR

RFOD reframes tabular outlier detection as a feature-wise conditional reconstruction problem by training a dedicated Random Forest for each feature. It combines forest pruning, Adjusted Gower's Distance to score cell-level deviations, and Uncertainty-Weighted Averaging to produce robust row-level anomaly scores with fine-grained interpretability. Across 15 real-world datasets, RFOD delivers strong detection performance, outperforming both data-mining and deep-learning baselines, while achieving favorable efficiency through parallelizable training and pruning. The approach directly handles mixed-type data without lossy encoding and provides actionable explanations at the cell and row levels, making it well-suited for high-stakes domains. The work demonstrates RFOD's scalability and robustness, supported by ablation studies and case analyses that highlight the practical value of interpretable anomaly localization.

Abstract

Outlier detection in tabular data is crucial for safeguarding data integrity in high-stakes domains such as cybersecurity, financial fraud detection, and healthcare, where anomalies can cause serious operational and economic impacts. Despite advances in both data mining and deep learning, many existing methods struggle with mixed-type tabular data, often relying on encoding schemes that lose important semantic information. Moreover, they frequently lack interpretability, offering little insight into which specific values cause anomalies. To overcome these challenges, we introduce \textsf{\textbf{RFOD}}, a novel \textsf{\textbf{R}}andom \textsf{\textbf{F}}orest-based \textsf{\textbf{O}}utlier \textsf{\textbf{D}}etection framework tailored for tabular data. Rather than modeling a global joint distribution, \textsf{RFOD} reframes anomaly detection as a feature-wise conditional reconstruction problem, training dedicated random forests for each feature conditioned on the others. This design robustly handles heterogeneous data types while preserving the semantic integrity of categorical features. To further enable precise and interpretable detection, \textsf{RFOD} combines Adjusted Gower's Distance (AGD) for cell-level scoring, which adapts to skewed numerical data and accounts for categorical confidence, with Uncertainty-Weighted Averaging (UWA) to aggregate cell-level scores into robust row-level anomaly scores. Extensive experiments on 15 real-world datasets demonstrate that \textsf{RFOD} consistently outperforms state-of-the-art baselines in detection accuracy while offering superior robustness, scalability, and interpretability for mixed-type tabular data.

Paper Structure

This paper contains 55 sections, 13 equations, 13 figures, 4 tables, 1 algorithm.

Figures (13)

  • Figure 1: Average anomaly detection performance of all methods across 15 real-world tabular datasets. RFOD consistently ranks at or near the top, demonstrating strong accuracy and robustness over both data mining and deep learning baselines.
  • Figure 2: Overview of RFOD. Given training data $\bm{X}_{\text{train}} \in \mathbb{R}^{n \times d}$, RFOD applies a leave-one-feature-out strategy, training a dedicated forest $\bm{RF}_j$ for each feature $\bm{x}^j$ using the remaining features. Each forest is pruned via out-of-bag (OOB) validation to enhance generalization. At inference, the pruned forests reconstruct $\bm{X}_{\text{test}}$ to form $\hat{\bm{X}}_{\text{test}}$, which is compared against $\bm{X}_{\text{test}}$ using Adjusted Gower's Distance (AGD) to compute cell-level anomaly scores $\bm{S}_{\text{cell}}$. These are aggregated into row-level scores $\bm{s}_{\text{row}}$ via Uncertainty-Weighted Averaging (UWA) for interpretable and robust detection.
  • Figure 3: Illustration of forest pruning with varying retaining ratio $\beta$.
  • Figure 4: Challenges of existing Gower's Distance (GD) and the proposed AGD for numerical features.
  • Figure 5: Detection accuracy of all evaluated methods across 15 benchmark tabular datasets.
  • ...and 8 more figures

Theorems & Definitions (6)

  • Definition 3.1: Outlier Detection
  • Example 4.1
  • Definition 4.1: AGD for Numerical Features
  • Example 4.2: AGD for Numerical Features
  • Definition 4.2: AGD for Categorical Features
  • Example 4.3: AGD for Categorical Features