Table of Contents
Fetching ...

Consistency-guided semi-supervised outlier detection in heterogeneous data using fuzzy rough sets

Baiyang Chen, Zhong Yuan, Dezhong Peng, Xiaoliang Chen, Hongmei Chen

TL;DR

The paper addresses outlier detection in heterogeneous tabular data under limited supervision by introducing COD, a fuzzy rough set–based framework. COD builds label-informed fuzzy similarity relations, evaluates attribute contributions via classification consistency, and fuses an per-attribute outlier factor into a COD score for each instance. Empirical results on 20 public datasets show COD outperforming or matching leading detectors, with particular strength on mixed and categorical data and robustness to the amount of negative sampling. The work advances practical semi-supervised outlier detection in heterogeneous domains and highlights the value of consistency-guided reasoning in FRS-based methods.

Abstract

Outlier detection aims to find samples that behave differently from the majority of the data. Semi-supervised detection methods can utilize the supervision of partial labels, thus reducing false positive rates. However, most of the current semi-supervised methods focus on numerical data and neglect the heterogeneity of data information. In this paper, we propose a consistency-guided outlier detection algorithm (COD) for heterogeneous data with the fuzzy rough set theory in a semi-supervised manner. First, a few labeled outliers are leveraged to construct label-informed fuzzy similarity relations. Next, the consistency of the fuzzy decision system is introduced to evaluate attributes' contributions to knowledge classification. Subsequently, we define the outlier factor based on the fuzzy similarity class and predict outliers by integrating the classification consistency and the outlier factor. The proposed algorithm is extensively evaluated on 15 freshly proposed datasets. Experimental results demonstrate that COD is better than or comparable with the leading outlier detectors. This manuscript is the accepted author version of a paper published by Elsevier. The final published version is available at https://doi.org/10.1016/j.asoc.2024.112070

Consistency-guided semi-supervised outlier detection in heterogeneous data using fuzzy rough sets

TL;DR

The paper addresses outlier detection in heterogeneous tabular data under limited supervision by introducing COD, a fuzzy rough set–based framework. COD builds label-informed fuzzy similarity relations, evaluates attribute contributions via classification consistency, and fuses an per-attribute outlier factor into a COD score for each instance. Empirical results on 20 public datasets show COD outperforming or matching leading detectors, with particular strength on mixed and categorical data and robustness to the amount of negative sampling. The work advances practical semi-supervised outlier detection in heterogeneous domains and highlights the value of consistency-guided reasoning in FRS-based methods.

Abstract

Outlier detection aims to find samples that behave differently from the majority of the data. Semi-supervised detection methods can utilize the supervision of partial labels, thus reducing false positive rates. However, most of the current semi-supervised methods focus on numerical data and neglect the heterogeneity of data information. In this paper, we propose a consistency-guided outlier detection algorithm (COD) for heterogeneous data with the fuzzy rough set theory in a semi-supervised manner. First, a few labeled outliers are leveraged to construct label-informed fuzzy similarity relations. Next, the consistency of the fuzzy decision system is introduced to evaluate attributes' contributions to knowledge classification. Subsequently, we define the outlier factor based on the fuzzy similarity class and predict outliers by integrating the classification consistency and the outlier factor. The proposed algorithm is extensively evaluated on 15 freshly proposed datasets. Experimental results demonstrate that COD is better than or comparable with the leading outlier detectors. This manuscript is the accepted author version of a paper published by Elsevier. The final published version is available at https://doi.org/10.1016/j.asoc.2024.112070

Paper Structure

This paper contains 21 sections, 2 theorems, 15 equations, 3 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

For any attribute subset $B, P\subseteq C$, if $B\subseteq P$, then $\widetilde{P} \subseteq \widetilde{B}$.

Figures (3)

  • Figure 1: Overall framework of COD.
  • Figure 2: AUC-ROC scores w.r.t. multiple levels of supervision on 20 experimental datasets. The best-unsupervised method is IForest which achieves the highest average AUC-ROC score.
  • Figure 3: COD's performances w.r.t. the number of selected negative instances.

Theorems & Definitions (16)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Example 1
  • Proposition 1
  • Definition 7
  • Definition 8
  • ...and 6 more