Table of Contents
Fetching ...

Cluster Metric Sensitivity to Irrelevant Features

Miles McCrory, Spencer A. Thomas

TL;DR

The paper investigates how adding irrelevant, uncorrelated features affects clustering results in unsupervised learning, focusing on how different metrics react to Gaussian versus Uniform noise and data scaling. Using Dimsets datasets with ground-truth labels and a controlled k-means setup, it evaluates ARI, NMI, RI, Silhouette Coefficient, and Davies-Bouldin under varied feature ratios. The results show that Silhouette and Davies-Bouldin are highly sensitive to added features and thus good candidates for driving unsupervised feature selection, while ARI and NMI are more robust to Gaussian noise but can exhibit tipping points under Uniform noise; standardization helps stabilize metrics and erase tipping points. These findings inform metric selection and preprocessing choices for high-dimensional, noisy unsupervised clustering tasks.

Abstract

Clustering algorithms are used extensively in data analysis for data exploration and discovery. Technological advancements lead to continually growth of data in terms of volume, dimensionality and complexity. This provides great opportunities in data analytics as the data can be interrogated for many different purposes. This however leads challenges, such as identification of relevant features for a given task. In supervised tasks, one can utilise a number of methods to optimise the input features for the task objective (e.g. classification accuracy). In unsupervised problems, such tools are not readily available, in part due to an inability to quantify feature relevance in unlabeled tasks. In this paper, we investigate the sensitivity of clustering performance noisy uncorrelated variables iteratively added to baseline datasets with well defined clusters. We show how different types of irrelevant variables can impact the outcome of a clustering result from $k$-means in different ways. We observe a resilience to very high proportions of irrelevant features for adjusted rand index (ARI) and normalised mutual information (NMI) when the irrelevant features are Gaussian distributed. For Uniformly distributed irrelevant features, we notice the resilience of ARI and NMI is dependent on the dimensionality of the data and exhibits tipping points between high scores and near zero. Our results show that the Silhouette Coefficient and the Davies-Bouldin score are the most sensitive to irrelevant added features exhibiting large changes in score for comparably low proportions of irrelevant features regardless of underlying distribution or data scaling. As such the Silhouette Coefficient and the Davies-Bouldin score are good candidates for optimising feature selection in unsupervised clustering tasks.

Cluster Metric Sensitivity to Irrelevant Features

TL;DR

The paper investigates how adding irrelevant, uncorrelated features affects clustering results in unsupervised learning, focusing on how different metrics react to Gaussian versus Uniform noise and data scaling. Using Dimsets datasets with ground-truth labels and a controlled k-means setup, it evaluates ARI, NMI, RI, Silhouette Coefficient, and Davies-Bouldin under varied feature ratios. The results show that Silhouette and Davies-Bouldin are highly sensitive to added features and thus good candidates for driving unsupervised feature selection, while ARI and NMI are more robust to Gaussian noise but can exhibit tipping points under Uniform noise; standardization helps stabilize metrics and erase tipping points. These findings inform metric selection and preprocessing choices for high-dimensional, noisy unsupervised clustering tasks.

Abstract

Clustering algorithms are used extensively in data analysis for data exploration and discovery. Technological advancements lead to continually growth of data in terms of volume, dimensionality and complexity. This provides great opportunities in data analytics as the data can be interrogated for many different purposes. This however leads challenges, such as identification of relevant features for a given task. In supervised tasks, one can utilise a number of methods to optimise the input features for the task objective (e.g. classification accuracy). In unsupervised problems, such tools are not readily available, in part due to an inability to quantify feature relevance in unlabeled tasks. In this paper, we investigate the sensitivity of clustering performance noisy uncorrelated variables iteratively added to baseline datasets with well defined clusters. We show how different types of irrelevant variables can impact the outcome of a clustering result from -means in different ways. We observe a resilience to very high proportions of irrelevant features for adjusted rand index (ARI) and normalised mutual information (NMI) when the irrelevant features are Gaussian distributed. For Uniformly distributed irrelevant features, we notice the resilience of ARI and NMI is dependent on the dimensionality of the data and exhibits tipping points between high scores and near zero. Our results show that the Silhouette Coefficient and the Davies-Bouldin score are the most sensitive to irrelevant added features exhibiting large changes in score for comparably low proportions of irrelevant features regardless of underlying distribution or data scaling. As such the Silhouette Coefficient and the Davies-Bouldin score are good candidates for optimising feature selection in unsupervised clustering tasks.
Paper Structure (7 sections, 9 equations, 3 figures)

This paper contains 7 sections, 9 equations, 3 figures.

Figures (3)

  • Figure 1: Workflow for the experimentation used in this work.
  • Figure 2: A comparison of clustering performance with different random number generations and data scaling methods. Columns depict different scaling and random number generation methods, and rows illustrate clustering performance metrics averaged over 50 independent runs. The ratio of random variables to informative features ranges from 0:1 (baseline model) to 3:1 where 75% of the input data is randomly generated and therefore does not correlated to the cluster label. Note that higher Davis-Bouldin scores indicate worse clustering performance unlike the other metrics.
  • Figure 3: Standard deviation ($\sigma$) value of clustering metrics score for 50 independent repeats. As in Fig. \ref{['fig:metrics']} we plot these values as a function of the proportion of random variables to features. Rows and columns are as in Fig. \ref{['fig:metrics']}.