Table of Contents
Fetching ...

Tree-based Ensemble Learning for Out-of-distribution Detection

Zhaiming Shen, Menglun Wang, Guang Cheng, Ming-Jun Lai, Lin Mu, Ruihao Huang, Qi Liu, Hao Zhu

TL;DR

This work tackles out-of-distribution detection by proposing TOOD detection, a tree-based, ensemble approach that derives a leaf-based embedding for each sample and uses the average pairwise Hamming distance (APHD) to distinguish in-distribution from out-of-distribution data. The method emphasizes interpretability, robustness, efficiency, and adaptability to unsupervised settings, supported by theoretical analysis and extensive experiments across tabular, image, and text data. The key contributions include a formal tree-embedding framework, APHD-based OOD scoring with Hoeffding-concentration guarantees, and strong empirical results against several neural-network-based OOD detectors. The practical impact lies in a simple, generalizable detector that can operate with limited tuning and remains effective under perturbations and across diverse data modalities.

Abstract

Being able to successfully determine whether the testing samples has similar distribution as the training samples is a fundamental question to address before we can safely deploy most of the machine learning models into practice. In this paper, we propose TOOD detection, a simple yet effective tree-based out-of-distribution (TOOD) detection mechanism to determine if a set of unseen samples will have similar distribution as of the training samples. The TOOD detection mechanism is based on computing pairwise hamming distance of testing samples' tree embeddings, which are obtained by fitting a tree-based ensemble model through in-distribution training samples. Our approach is interpretable and robust for its tree-based nature. Furthermore, our approach is efficient, flexible to various machine learning tasks, and can be easily generalized to unsupervised setting. Extensive experiments are conducted to show the proposed method outperforms other state-of-the-art out-of-distribution detection methods in distinguishing the in-distribution from out-of-distribution on various tabular, image, and text data.

Tree-based Ensemble Learning for Out-of-distribution Detection

TL;DR

This work tackles out-of-distribution detection by proposing TOOD detection, a tree-based, ensemble approach that derives a leaf-based embedding for each sample and uses the average pairwise Hamming distance (APHD) to distinguish in-distribution from out-of-distribution data. The method emphasizes interpretability, robustness, efficiency, and adaptability to unsupervised settings, supported by theoretical analysis and extensive experiments across tabular, image, and text data. The key contributions include a formal tree-embedding framework, APHD-based OOD scoring with Hoeffding-concentration guarantees, and strong empirical results against several neural-network-based OOD detectors. The practical impact lies in a simple, generalizable detector that can operate with limited tuning and remains effective under perturbations and across diverse data modalities.

Abstract

Being able to successfully determine whether the testing samples has similar distribution as the training samples is a fundamental question to address before we can safely deploy most of the machine learning models into practice. In this paper, we propose TOOD detection, a simple yet effective tree-based out-of-distribution (TOOD) detection mechanism to determine if a set of unseen samples will have similar distribution as of the training samples. The TOOD detection mechanism is based on computing pairwise hamming distance of testing samples' tree embeddings, which are obtained by fitting a tree-based ensemble model through in-distribution training samples. Our approach is interpretable and robust for its tree-based nature. Furthermore, our approach is efficient, flexible to various machine learning tasks, and can be easily generalized to unsupervised setting. Extensive experiments are conducted to show the proposed method outperforms other state-of-the-art out-of-distribution detection methods in distinguishing the in-distribution from out-of-distribution on various tabular, image, and text data.
Paper Structure (33 sections, 5 theorems, 7 equations, 7 figures, 17 tables, 1 algorithm)

This paper contains 33 sections, 5 theorems, 7 equations, 7 figures, 17 tables, 1 algorithm.

Key Result

Lemma 4.1

Let $\mathcal{H}_{\ell}$ be the set of all decision regions obtained from the training samples when growing the $\ell$-th tree. For any ${\bf x}_1, {\bf x}_2\in\mathcal{M}$, we have

Figures (7)

  • Figure 1: Training samples for tree-based ensemble learning
  • Figure 2: Tree-based ensemble learning for in-distribution (left panel) and out-of-distribution (right panel)
  • Figure 3: Boxplot of AHPD values when CIFAR-10 is the in-distribution dataset.
  • Figure 4: APHD values for different training sample sizes (Top row: with original labels. Bottom row: with randomly shuffled labels). From left to right: Electricity (in-distribution) vs Others (out-of-distribution); FashionMNIST (in-distribution) vs Others (out-of-distribution); CIFAR-100 (in-distribution) vs Others (out-of-distribution); IMDB (in-distribution) vs Others (out-of-distribution)
  • Figure 5: AUROC and AUPR scores of original images and images under FGSM attack for CIFAR-10 (first and second subplots) and CIFAR-100 (third and fourth subplots) as in-distribution data
  • ...and 2 more figures

Theorems & Definitions (8)

  • Lemma 4.1
  • Theorem 4.1
  • Theorem 4.2
  • Remark 4.1
  • Theorem 4.3
  • Remark 4.2
  • Theorem 4.4
  • Remark 4.3