Table of Contents
Fetching ...

UniOD: A Universal Model for Outlier Detection across Diverse Domains

Dazhi Fu, Jicong Fan

Abstract

Outlier detection (OD), distinguishing inliers and outliers in completely unlabeled datasets, plays a vital role in science and engineering. Although there have been many insightful OD methods, most of them require troublesome hyperparameter tuning (a challenge in unsupervised learning) and costly model training for every task or dataset. In this work, we propose UniOD, a universal OD framework that leverages labeled datasets to train a single model capable of detecting outliers of datasets with different feature dimensions and heterogeneous feature spaces from diverse domains. Specifically, UniOD extracts uniform and comparable features across different datasets by constructing and factorizing multi-scale point-wise similarity matrices. It then employs graph neural networks to capture comprehensive within-dataset and between-dataset information simultaneously, and formulates outlier detection tasks as node classification tasks. As a result, once the training is complete, UniOD can identify outliers in datasets from diverse domains without any further model/hyperparameter selection and parameter optimization, which greatly improves convenience and accuracy in real applications. More importantly, we provide theoretical guarantees for the effectiveness of UniOD, consistent with our numerical results. We evaluate UniOD on 30 benchmark OD datasets against 17 baselines, demonstrating its effectiveness and superiority. Our code is available at https://github.com/fudazhiaka/UniOD.

UniOD: A Universal Model for Outlier Detection across Diverse Domains

Abstract

Outlier detection (OD), distinguishing inliers and outliers in completely unlabeled datasets, plays a vital role in science and engineering. Although there have been many insightful OD methods, most of them require troublesome hyperparameter tuning (a challenge in unsupervised learning) and costly model training for every task or dataset. In this work, we propose UniOD, a universal OD framework that leverages labeled datasets to train a single model capable of detecting outliers of datasets with different feature dimensions and heterogeneous feature spaces from diverse domains. Specifically, UniOD extracts uniform and comparable features across different datasets by constructing and factorizing multi-scale point-wise similarity matrices. It then employs graph neural networks to capture comprehensive within-dataset and between-dataset information simultaneously, and formulates outlier detection tasks as node classification tasks. As a result, once the training is complete, UniOD can identify outliers in datasets from diverse domains without any further model/hyperparameter selection and parameter optimization, which greatly improves convenience and accuracy in real applications. More importantly, we provide theoretical guarantees for the effectiveness of UniOD, consistent with our numerical results. We evaluate UniOD on 30 benchmark OD datasets against 17 baselines, demonstrating its effectiveness and superiority. Our code is available at https://github.com/fudazhiaka/UniOD.

Paper Structure

This paper contains 43 sections, 11 theorems, 44 equations, 5 figures, 19 tables, 1 algorithm.

Key Result

Theorem 4.1

Let $b_A=\max_{i,k}{\|\mathbf{A}_{H_i,\sigma_k}\|_2}$, $c_X=\max_k\sqrt{\sum_{i}\|\mathbf{X}_{H_i,\sigma_{k}}\|_F^2}$, $b_{W}=\max_{\mathbf{W}\in\mathcal{W}}\|\mathbf{W}\|_2$, $b_{W}'=\max_{\mathbf{W}\in\mathcal{W}}\|\mathbf{W}\|_{2,1}$. Suppose all activation functions are $1$-Lipschitz, and $\ell$ where $C_{\text{GIN}} = b_A^{L} c_X b_W^{LL'}L^{3/2}{L'}^{3/2}(b_W'/b_W)\sqrt{\ln(2\bar{d}^2)}$, $C

Figures (5)

  • Figure 1: Pipeline comparison between UniOD and conventional OD methods. These approaches train a separate model per dataset, while UniOD leverages a collection of historical labeled datasets to train a single universal model.
  • Figure 2: Examples of the performance sensitivity of OD methods to their hyperparameters.
  • Figure 3: Framework of UniOD. UniOD utilizes multiple labeled datasets to train a universal GNN‐based classifier that generalizes across data dimensions and domains for OD.
  • Figure 4: Ablation study: (a) Average performance using different numbers of historical datasets; (b) Average performance using different number of bandwidths for similarity matrices.
  • Figure 5: T-SNE visualization results of learned representations $\mathbf{Z}_{T_i}$ on several datasets.

Theorems & Definitions (15)

  • Theorem 4.1
  • Lemma A.1
  • Lemma A.2
  • Lemma A.3
  • Lemma A.4
  • Lemma A.5: Lemma A.5 of bartlett2017spectrally
  • Lemma A.6
  • proof
  • Lemma B.1
  • Lemma B.2: Covering number bound of GIN
  • ...and 5 more