Table of Contents
Fetching ...

Automatic Parameter Selection for Non-Redundant Clustering

Collin Leiber, Dominik Mautz, Claudia Plant, Christian Böhm

TL;DR

This paper tackles the challenge of automatically discovering multiple non-redundant clusterings in high-dimensional data by introducing an MDL-based framework that jointly infers the number of subspaces $J$ and clusters per subspace $k_j$. A greedy parameter-space search splits and merges subspaces and clusters, with an integrated outlier detection mechanism that penalizes outliers directly in the MDL objective. The AutoNR algorithm embodies this framework by combining MDL encoding with Nr-Kmeans, using a tied-variance MVN model in each subspace to compute encoding costs and derive outlier thresholds. Empirical results on synthetic and real data demonstrate competitive or superior performance to state-of-the-art methods, with notable stability and improvements when outliers are detected. The approach offers a scalable, parameter-free solution for discovering multiple, interpretable subspace clusterings in high-dimensional spaces, with potential extensions to multi-view settings.

Abstract

High-dimensional datasets often contain multiple meaningful clusterings in different subspaces. For example, objects can be clustered either by color, weight, or size, revealing different interpretations of the given dataset. A variety of approaches are able to identify such non-redundant clusterings. However, most of these methods require the user to specify the expected number of subspaces and clusters for each subspace. Stating these values is a non-trivial problem and usually requires detailed knowledge of the input dataset. In this paper, we propose a framework that utilizes the Minimum Description Length Principle (MDL) to detect the number of subspaces and clusters per subspace automatically. We describe an efficient procedure that greedily searches the parameter space by splitting and merging subspaces and clusters within subspaces. Additionally, an encoding strategy is introduced that allows us to detect outliers in each subspace. Extensive experiments show that our approach is highly competitive to state-of-the-art methods.

Automatic Parameter Selection for Non-Redundant Clustering

TL;DR

This paper tackles the challenge of automatically discovering multiple non-redundant clusterings in high-dimensional data by introducing an MDL-based framework that jointly infers the number of subspaces and clusters per subspace . A greedy parameter-space search splits and merges subspaces and clusters, with an integrated outlier detection mechanism that penalizes outliers directly in the MDL objective. The AutoNR algorithm embodies this framework by combining MDL encoding with Nr-Kmeans, using a tied-variance MVN model in each subspace to compute encoding costs and derive outlier thresholds. Empirical results on synthetic and real data demonstrate competitive or superior performance to state-of-the-art methods, with notable stability and improvements when outliers are detected. The approach offers a scalable, parameter-free solution for discovering multiple, interpretable subspace clusterings in high-dimensional spaces, with potential extensions to multi-view settings.

Abstract

High-dimensional datasets often contain multiple meaningful clusterings in different subspaces. For example, objects can be clustered either by color, weight, or size, revealing different interpretations of the given dataset. A variety of approaches are able to identify such non-redundant clusterings. However, most of these methods require the user to specify the expected number of subspaces and clusters for each subspace. Stating these values is a non-trivial problem and usually requires detailed knowledge of the input dataset. In this paper, we propose a framework that utilizes the Minimum Description Length Principle (MDL) to detect the number of subspaces and clusters per subspace automatically. We describe an efficient procedure that greedily searches the parameter space by splitting and merging subspaces and clusters within subspaces. Additionally, an encoding strategy is introduced that allows us to detect outliers in each subspace. Extensive experiments show that our approach is highly competitive to state-of-the-art methods.
Paper Structure (15 sections, 29 equations, 4 figures, 1 table)

This paper contains 15 sections, 29 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Images of the letters 'A', 'B', 'C', 'X', 'Y' and 'Z' in the colors pink, cyan, and yellow. In each image, one corner is highlighted in color. This results in three different clusterings with 6, 3, and 4 clusters.
  • Figure 2: Example execution of AutoNR on a sample dataset ($d=11$). The arrows indicate which subspaces are affected by an operation. After two noise space splits, a cluster space split and a cluster space merge, three cluster spaces are identified with four, three, and two clusters.
  • Figure 3: Clustering results of AutoNR with and without outlier detection on the second subspace of syn3o.
  • Figure 4: The cluster centers in the three subspaces of NRLetters as identified by AutoNR.