Detecting outliers by clustering algorithms
Qi Li, Shuliang Wang
TL;DR
ODAR reframes outlier detection as a clustering-enabled transformation by mapping data into ODAR space, defined by local density $\rho$ and high-order density $h\rho$, to create two clearly separable clusters for outliers and normals. It introduces a shrinking step and a component clustering strategy to ensure universality across diverse clustering algorithms. Across ten datasets, ODAR-enhanced clustering achieves an average accuracy around $0.84$, improves leading baselines on several tasks, and remains robust to distribution, outlier count, and density imbalances while maintaining practical runtimes. The work presents a scalable, versatile method that broadens the applicability of outlier detection in clustering workflows with minimal parameter sensitivity and solid empirical validation.
Abstract
Clustering and outlier detection are two important tasks in data mining. Outliers frequently interfere with clustering algorithms to determine the similarity between objects, resulting in unreliable clustering results. Currently, only a few clustering algorithms (e.g., DBSCAN) have the ability to detect outliers to eliminate interference. For other clustering algorithms, it is tedious to introduce another outlier detection task to eliminate outliers before each clustering process. Obviously, how to equip more clustering algorithms with outlier detection ability is very meaningful. Although a common strategy allows clustering algorithms to detect outliers based on the distance between objects and clusters, it is contradictory to improving the performance of clustering algorithms on the datasets with outliers. In this paper, we propose a novel outlier detection approach, called ODAR, for clustering. ODAR maps outliers and normal objects into two separated clusters by feature transformation. As a result, any clustering algorithm can detect outliers by identifying clusters. Experiments show that ODAR is robust to diverse datasets. Compared with baseline methods, the clustering algorithms achieve the best on 7 out of 10 datasets with the help of ODAR, with at least 5% improvement in accuracy.
