Table of Contents
Fetching ...

Clustering by Nonparametric Smoothing

David P. Hofmeyr

TL;DR

The paper reframes clustering as estimating a continuous membership function $f^*:\mathcal{X}\to \Pi_K$ via nonparametric smoothing, avoiding explicit parametric cluster models. It introduces a stabilised, absorbing-state Markov-chain formulation that yields a closed-form limit $\lim_{t\to\infty} \hat{f}^*_{t}(\mathbf{x}_i) = \lambda \left(\mathbf{I}-(1-\lambda)\mathbf{W}\right)^{-1}_{i:}\hat{\mathbf{F}}^*_0$, with data-driven tuning for $\lambda$, $k$, and $K$ through the criterion $C(\lambda,k,K)/R(\lambda,k)$. The authors provide extensive experiments on 45 public datasets, comparing CNS against a broad suite of baselines and demonstrating competitive performance, aided by a scalable, sparse weighting scheme and a practical initialization strategy. An R implementation is released, underscoring the method's practical applicability for flexible clustering without strong parametric assumptions.

Abstract

A novel formulation of the clustering problem is introduced in which the task is expressed as an estimation problem, where the object to be estimated is a function which maps a point to its distribution of cluster membership. Unlike existing approaches which implicitly estimate such a function, like Gaussian Mixture Models (GMMs), the proposed approach bypasses any explicit modelling assumptions and exploits the flexible estimation potential of nonparametric smoothing. An intuitive approach for selecting the tuning parameters governing estimation is provided, which allows the proposed method to automatically determine both an appropriate level of flexibility and also the number of clusters to extract from a given data set. Experiments on a large collection of publicly available data sets are used to document the strong performance of the proposed approach, in comparison with relevant benchmarks from the literature. R code to implement the proposed approach is available from https://github.com/DavidHofmeyr/CNS

Clustering by Nonparametric Smoothing

TL;DR

The paper reframes clustering as estimating a continuous membership function via nonparametric smoothing, avoiding explicit parametric cluster models. It introduces a stabilised, absorbing-state Markov-chain formulation that yields a closed-form limit , with data-driven tuning for , , and through the criterion . The authors provide extensive experiments on 45 public datasets, comparing CNS against a broad suite of baselines and demonstrating competitive performance, aided by a scalable, sparse weighting scheme and a practical initialization strategy. An R implementation is released, underscoring the method's practical applicability for flexible clustering without strong parametric assumptions.

Abstract

A novel formulation of the clustering problem is introduced in which the task is expressed as an estimation problem, where the object to be estimated is a function which maps a point to its distribution of cluster membership. Unlike existing approaches which implicitly estimate such a function, like Gaussian Mixture Models (GMMs), the proposed approach bypasses any explicit modelling assumptions and exploits the flexible estimation potential of nonparametric smoothing. An intuitive approach for selecting the tuning parameters governing estimation is provided, which allows the proposed method to automatically determine both an appropriate level of flexibility and also the number of clusters to extract from a given data set. Experiments on a large collection of publicly available data sets are used to document the strong performance of the proposed approach, in comparison with relevant benchmarks from the literature. R code to implement the proposed approach is available from https://github.com/DavidHofmeyr/CNS

Paper Structure

This paper contains 11 sections, 15 equations, 8 tables.