Conformalized Semi-supervised Random Forest for Classification and Abnormality Detection

Yujin Han; Mingwenchan Xu; Leying Guan

Conformalized Semi-supervised Random Forest for Classification and Abnormality Detection

Yujin Han, Mingwenchan Xu, Leying Guan

TL;DR

The paper introduces CSForest, a conformalized semi-supervised random forest that outputs calibrated set-valued predictions under distributional drift. It combines unlabeled test data with Jackknife+aB conformalization and a target density $\mu(x)=f_{te}(x)+w f_{tr}(x)$ to detect unseen outliers and improve inlier accuracy. A theoretical result guarantees true-label coverage for observed classes under arbitrary label shift (GLS). Empirically, CSForest demonstrates strong outlier detection and robust inlier performance across synthetic and real datasets (e.g., MNIST, FashionMNIST, CIFAR-10) and remains stable as training and test sizes vary. Code is publicly available at the provided GitHub repository.

Abstract

The Random Forests classifier, a widely utilized off-the-shelf classification tool, assumes training and test samples come from the same distribution as other standard classifiers. However, in safety-critical scenarios like medical diagnosis and network attack detection, discrepancies between the training and test sets, including the potential presence of novel outlier samples not appearing during training, can pose significant challenges. To address this problem, we introduce the Conformalized Semi-Supervised Random Forest (CSForest), which couples the conformalization technique Jackknife+aB with semi-supervised tree ensembles to construct a set-valued prediction $C(x)$. Instead of optimizing over the training distribution, CSForest employs unlabeled test samples to enhance accuracy and flag unseen outliers by generating an empty set. Theoretically, we establish CSForest to cover true labels for previously observed inlier classes under arbitrarily label-shift in the test data. We compare CSForest with state-of-the-art methods using synthetic examples and various real-world datasets, under different types of distribution changes in the test domain. Our results highlight CSForest's effective prediction of inliers and its ability to detect outlier samples unique to the test data. In addition, CSForest shows persistently good performance as the sizes of the training and test sets vary. Codes of CSForest are available at https://github.com/yujinhan98/CSForest.

Conformalized Semi-supervised Random Forest for Classification and Abnormality Detection

TL;DR

to detect unseen outliers and improve inlier accuracy. A theoretical result guarantees true-label coverage for observed classes under arbitrary label shift (GLS). Empirically, CSForest demonstrates strong outlier detection and robust inlier performance across synthetic and real datasets (e.g., MNIST, FashionMNIST, CIFAR-10) and remains stable as training and test sizes vary. Code is publicly available at the provided GitHub repository.

Abstract

. Instead of optimizing over the training distribution, CSForest employs unlabeled test samples to enhance accuracy and flag unseen outliers by generating an empty set. Theoretically, we establish CSForest to cover true labels for previously observed inlier classes under arbitrarily label-shift in the test data. We compare CSForest with state-of-the-art methods using synthetic examples and various real-world datasets, under different types of distribution changes in the test domain. Our results highlight CSForest's effective prediction of inliers and its ability to detect outlier samples unique to the test data. In addition, CSForest shows persistently good performance as the sizes of the training and test sets vary. Codes of CSForest are available at https://github.com/yujinhan98/CSForest.

Paper Structure (28 sections, 2 theorems, 23 equations, 12 figures, 6 tables, 2 algorithms)

This paper contains 28 sections, 2 theorems, 23 equations, 12 figures, 6 tables, 2 algorithms.

INTRODUCTION
RELATED WORK
CONFORMALIZED SEMI-SUPERVISED RANDOM FOREST
EXPERIMENTS
Synthetic Data
Real-World Data
The Outliers w/o Shift Setting
The Shift w/o Outliers Setting
Comparisons with Varying Sample Sizes
DISCUSSION
PROOFS
Proof of Proposition \ref{['prop:oracle']}
Proof of Theorem \ref{['thm:coverage']}
MORE DETAILS ON BASELINES
BCOPS, CRF, DC and ACRFrandom
...and 13 more sections

Key Result

Proposition 3.1

Set the conformal score function as $s(x, k;\mu) = [f_k(x)\slash\mu(x)]$. Under the GLS model, the solution to eq. (eq:GLS) is $C(x) =\{k: \mathbb{E}_X[\mathbbm{1}\{s(x,k; \mu)\geq s(X,k;\mu)\}|Y=k]\geq \alpha,k=1,\ldots, K\}$.

Figures (12)

Figure 1: Overview of CSForest. For class $k$, let $\mathcal{I}^b_{k}$, $\mathcal{I}^b_{te}$ and $\tilde{\mathcal{I}}_{other}$ be Bootstrap samples from from training class $k$, test samples and training samples other than class $k$. We train a multi-class tree classifier with random feature selection as in the random forest using the Bootstrapped samples, where we maintain the labels all training samples and treat the test set as its own class. The resulting $B$ random forest tree classifiers, $\{\hat{G}^1(x),..., \hat{G}^B(x)\}$, are used to separate different labeled classes and the test samples. For the sample pair $x_i\in \mathcal{I}_{te}$ and $x_i'\in \mathcal{I}_{k}$, we aggregate trees that do not use $x_i$ and $x_i'$ (i.e., the data $\mathcal{B}_{ii'}=\{b: i\notin \mathcal{I}_{te}^{b}, i'\notin \mathcal{I}^{b}_k\}$) to form an ensemble classifier, and subsequently, an ensemble conformal score function $\hat{s}^{ii'}(x, k;\mu)$ . Finally, we use the score function $\hat{s}^{ii'}(x, k; \mu)$ and compare $\hat{s}^{ii'}(x_i, k; \mu)$ to $\hat{s}^{ii'}(x_{i'}, k; \mu)$ for all $i' \in \mathcal{I}_{k}$ to form the calibrated evaluation $\hat{s}_{ik}$ for test sample $x_i$ being in class $k$ and include $k$ in the prediction set $\hat{C}(x_i)$ if $\hat{s}_{ik}$ is no smaller than $\alpha$.
Figure 2: Panel A shows the first two dimensions of samples are generated from the three classes: green/ blue/ red points representing samples from class 1/ 2/ R. Panel B shows the coverage rate which is defined by the proportion of samples with true labels included in their prediction sets. The horizontal dash line refers to the coverage level of 95%. Panel B is grouped by the actual labels in the testing data and colored based on if a prediction set contains only the correct label (blue) or more than the correct label (gray).
Figure 3: Per-class quality evaluation on MNIST. Panel A and B were grouped by the true labels in the testing data and colored based on whether a prediction set contains only the correct label (blue) or more than the correct label (gray). The horizontal dash line refers to the coverage level of 95%.
Figure 4: The type II error for inliers and outliers across different sample sizes on MNIST. Figure \ref{['fig:MNISTvaringsize']} demonstrates that CSForest outperforms the baselines by efficiently detecting outliers while maintaining lower inlier type II errors across various sample sizes. Note that error bars here are calculated based on repeated sample-splitting and can be smaller than the standard deviation due to sample dependence from different runs.
Figure 5: Achieved Type II errors for inliers and outliers across 100 repetitions at $\alpha = 0.05$ with merely 5 samples per-class in the test cohort.
...and 7 more figures

Theorems & Definitions (6)

Proposition 3.1
Remark 3.2
Theorem 3.3
Example 1
proof
proof

Conformalized Semi-supervised Random Forest for Classification and Abnormality Detection

TL;DR

Abstract

Conformalized Semi-supervised Random Forest for Classification and Abnormality Detection

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (6)