Table of Contents
Fetching ...

Label-Informed Outlier Detection Based on Granule Density

Baiyang Chen, Zhong Yuan, Dezhong Peng, Hongmei Chen, Xiaomin Song, Huiming Zheng

TL;DR

The paper tackles outlier detection in heterogeneous data with limited labeled examples. It introduces Granule Density-based Outlier Factor (GDOF), a label-informed framework that uses fuzzy granulation and granule density to model uncertainty and diverse data types. Attribute relevance learned from labels is aggregated across attributes to produce per-object outlier scores, enabling robust detection with few labeled outliers. Experiments on 20 real-world datasets and public code demonstrate competitive performance across data types and parameter settings, highlighting practical utility for complex data scenarios.

Abstract

Outlier detection, crucial for identifying unusual patterns with significant implications across numerous applications, has drawn considerable research interest. Existing semi-supervised methods typically treat data as purely numerical and} in a deterministic manner, thereby neglecting the heterogeneity and uncertainty inherent in complex, real-world datasets. This paper introduces a label-informed outlier detection method for heterogeneous data based on Granular Computing and Fuzzy Sets, namely Granule Density-based Outlier Factor (GDOF). Specifically, GDOF first employs label-informed fuzzy granulation to effectively represent various data types and develops granule density for precise density estimation. Subsequently, granule densities from individual attributes are integrated for outlier scoring by assessing attribute relevance with a limited number of labeled outliers. Experimental results on various real-world datasets show that GDOF stands out in detecting outliers in heterogeneous data with a minimal number of labeled outliers. The integration of Fuzzy Sets and Granular Computing in GDOF offers a practical framework for outlier detection in complex and diverse data types. All relevant datasets and source codes are publicly available for further research. This is the author's accepted manuscript of a paper published in IEEE Transactions on Fuzzy Systems. The final version is available at https://doi.org/10.1109/TFUZZ.2024.3514853

Label-Informed Outlier Detection Based on Granule Density

TL;DR

The paper tackles outlier detection in heterogeneous data with limited labeled examples. It introduces Granule Density-based Outlier Factor (GDOF), a label-informed framework that uses fuzzy granulation and granule density to model uncertainty and diverse data types. Attribute relevance learned from labels is aggregated across attributes to produce per-object outlier scores, enabling robust detection with few labeled outliers. Experiments on 20 real-world datasets and public code demonstrate competitive performance across data types and parameter settings, highlighting practical utility for complex data scenarios.

Abstract

Outlier detection, crucial for identifying unusual patterns with significant implications across numerous applications, has drawn considerable research interest. Existing semi-supervised methods typically treat data as purely numerical and} in a deterministic manner, thereby neglecting the heterogeneity and uncertainty inherent in complex, real-world datasets. This paper introduces a label-informed outlier detection method for heterogeneous data based on Granular Computing and Fuzzy Sets, namely Granule Density-based Outlier Factor (GDOF). Specifically, GDOF first employs label-informed fuzzy granulation to effectively represent various data types and develops granule density for precise density estimation. Subsequently, granule densities from individual attributes are integrated for outlier scoring by assessing attribute relevance with a limited number of labeled outliers. Experimental results on various real-world datasets show that GDOF stands out in detecting outliers in heterogeneous data with a minimal number of labeled outliers. The integration of Fuzzy Sets and Granular Computing in GDOF offers a practical framework for outlier detection in complex and diverse data types. All relevant datasets and source codes are publicly available for further research. This is the author's accepted manuscript of a paper published in IEEE Transactions on Fuzzy Systems. The final version is available at https://doi.org/10.1109/TFUZZ.2024.3514853

Paper Structure

This paper contains 23 sections, 3 theorems, 21 equations, 2 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1

Given an information system $(X, A)$, for any attribute subset $B,C\subseteq A$, if $C\subseteq B$, then $\widetilde{B} \subseteq \widetilde{C}$.

Figures (2)

  • Figure 1: AUC scores across different numbers of labeled outliers. The horizontal axis indicates the number of labeled outliers, ranging from 5 to 30.
  • Figure 2: GDOF's performances across various numbers of normal objects. The horizontal axis indicates the number of normal objects, ranging from 50 to 500.

Theorems & Definitions (15)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Proposition 1
  • Proof 1
  • Definition 5
  • Definition 6
  • Proposition 2
  • Proof 2
  • ...and 5 more