Table of Contents
Fetching ...

visClust: A visual clustering algorithm based on orthogonal projections

Anna Breger, Clemens Karner, Martin Ehler

TL;DR

visClust introduces a fast, parameter-light clustering method that leverages random orthogonal projections from the Grassmannian to produce low-dimensional representations, which are encoded as binary images and partitioned via simple image-processing steps. By iteratively sampling projections, filtering, thresholding, and analyzing connected components, it selects a partition that matches the target cluster count $n_c$ while requiring only one obligatory input parameter in the default setting. Across synthetic and publicly available datasets, visClust demonstrates strong ACC and ARI performance with favorable runtime and RAM consumption, often outperforming six well-known baselines; it remains robust under default settings and can benefit from parameter tuning or nonlinear projections for imaging data. The approach provides a practical, scalable clustering tool with publicly available code and clear potential for extensions to handle higher-dimensional or imaging-specific tasks through nonlinear embeddings like t-SNE.

Abstract

We present a novel clustering algorithm, visClust, that is based on lower dimensional data representations and visual interpretation. Thereto, we design a transformation that allows the data to be represented by a binary integer array enabling the use of image processing methods to select a partition. Qualitative and quantitative analyses measured in accuracy and an adjusted Rand-Index show that the algorithm performs well while requiring low runtime and RAM. We compare the results to 6 state-of-the-art algorithms with available code, confirming the quality of visClust by superior performance in most experiments. Moreover, the algorithm asks for just one obligatory input parameter while allowing optimization via optional parameters. The code is made available on GitHub and straightforward to use.

visClust: A visual clustering algorithm based on orthogonal projections

TL;DR

visClust introduces a fast, parameter-light clustering method that leverages random orthogonal projections from the Grassmannian to produce low-dimensional representations, which are encoded as binary images and partitioned via simple image-processing steps. By iteratively sampling projections, filtering, thresholding, and analyzing connected components, it selects a partition that matches the target cluster count while requiring only one obligatory input parameter in the default setting. Across synthetic and publicly available datasets, visClust demonstrates strong ACC and ARI performance with favorable runtime and RAM consumption, often outperforming six well-known baselines; it remains robust under default settings and can benefit from parameter tuning or nonlinear projections for imaging data. The approach provides a practical, scalable clustering tool with publicly available code and clear potential for extensions to handle higher-dimensional or imaging-specific tasks through nonlinear embeddings like t-SNE.

Abstract

We present a novel clustering algorithm, visClust, that is based on lower dimensional data representations and visual interpretation. Thereto, we design a transformation that allows the data to be represented by a binary integer array enabling the use of image processing methods to select a partition. Qualitative and quantitative analyses measured in accuracy and an adjusted Rand-Index show that the algorithm performs well while requiring low runtime and RAM. We compare the results to 6 state-of-the-art algorithms with available code, confirming the quality of visClust by superior performance in most experiments. Moreover, the algorithm asks for just one obligatory input parameter while allowing optimization via optional parameters. The code is made available on GitHub and straightforward to use.
Paper Structure (37 sections, 1 theorem, 17 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 37 sections, 1 theorem, 17 equations, 8 figures, 8 tables, 1 algorithm.

Key Result

Theorem 2.1

The expectation of the covering radius $\rho$ of $n$ random points $\{p_j\}_{j=1}^n$, i.i.d. with respect to $\mu_{k,d}$ satisfies

Figures (8)

  • Figure 1: (a) Original data $x \subset \mathbb{R}^3$, (b) projected data $qx \subset \mathbb{R}^2$ with a chosen $q \in \mathcal{V}_{2,3}$ (see Section \ref{['alg']}), (c) projected data using the projection provided by PCA. The synthetic dataset $x$ was generated in Python 3 using the function make_classification from the package sklearn. It consists of $3$ clusters with a total of $1000$ data points. The non-overlapping clusters have a skewed normal distribution with standard deviation $1$ within a $3$-dim hypercube of side-lengths $4$. On the left the unlabeled data are shown, the right plot corresponds to the same data with added ground truth information.
  • Figure 2: Substeps of the algorithm described in detail in Section \ref{['substeps']}.
  • Figure 3: Visual comparison of the results from the chosen clustering algorithms on $6$ synthetic data sets in $\mathbb{R}^2$. The needed time for termination is stated in seconds.The data sets of dimension $d=2$ consist each of $m=1500$ data points. They are generated by using Python 3 with the following functions from the module : make_circles (1), make_moons (2) and make_blobs (3-5). The last data set comprises of a single Gaussian cluster generated with the function random.normal from the module . The data points are randomly distributed along the different shapes with Gaussian distributions of varying standard deviations. In the last row the data consists of one big cluster but the methods were asked to return $3$ clusters. Note that AdaGAE had to be computed on a different computer and therefore the stated runtime here may not be directly comparable, see Section \ref{['numexp']}.
  • Figure 4: RAM (in MB) needed for the runtime experiments for a varying number of data points shown in Figure \ref{['ariruntime']} with synthetic data, where the number of data points $m$ varies, $d = 5$ and $n_c = 4$. The AdaGAE algorithm has been computed with GPU acceleration, therefore we report the needed RAM as well as Video RAM (VRAM).
  • Figure 5: ARI (mean) and runtime (mean) of 100 independent runs for varying number of data points $m$ (top), dimensions $d$ (middle) and clusters $n_c$ (bottom) for the data set described in Section \ref{['datarun']}. SpectACl requires at least 2 clusters. Default: $m = 1000$, $d = 5$ and $n_c = 4$.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Theorem 2.1: Reznikov and Saff 10.1093/imrn/rnv342
  • Remark 2.2
  • Remark 2.3
  • Remark 4.1