Automatic Parameter Selection for Non-Redundant Clustering
Collin Leiber, Dominik Mautz, Claudia Plant, Christian Böhm
TL;DR
This paper tackles the challenge of automatically discovering multiple non-redundant clusterings in high-dimensional data by introducing an MDL-based framework that jointly infers the number of subspaces $J$ and clusters per subspace $k_j$. A greedy parameter-space search splits and merges subspaces and clusters, with an integrated outlier detection mechanism that penalizes outliers directly in the MDL objective. The AutoNR algorithm embodies this framework by combining MDL encoding with Nr-Kmeans, using a tied-variance MVN model in each subspace to compute encoding costs and derive outlier thresholds. Empirical results on synthetic and real data demonstrate competitive or superior performance to state-of-the-art methods, with notable stability and improvements when outliers are detected. The approach offers a scalable, parameter-free solution for discovering multiple, interpretable subspace clusterings in high-dimensional spaces, with potential extensions to multi-view settings.
Abstract
High-dimensional datasets often contain multiple meaningful clusterings in different subspaces. For example, objects can be clustered either by color, weight, or size, revealing different interpretations of the given dataset. A variety of approaches are able to identify such non-redundant clusterings. However, most of these methods require the user to specify the expected number of subspaces and clusters for each subspace. Stating these values is a non-trivial problem and usually requires detailed knowledge of the input dataset. In this paper, we propose a framework that utilizes the Minimum Description Length Principle (MDL) to detect the number of subspaces and clusters per subspace automatically. We describe an efficient procedure that greedily searches the parameter space by splitting and merging subspaces and clusters within subspaces. Additionally, an encoding strategy is introduced that allows us to detect outliers in each subspace. Extensive experiments show that our approach is highly competitive to state-of-the-art methods.
