Table of Contents
Fetching ...

Generating Multidimensional Clusters With Support Lines

Nuno Fachada, Diogo de Andrade

TL;DR

Cugen is presented, a modular procedure for synthetic data generation, capable of creating multidimensional clusters supported by line segments using arbitrary distributions, and has the potential to be a widely used framework in diverse clustering-related research tasks.

Abstract

Synthetic data is essential for assessing clustering techniques, complementing and extending real data, and allowing for more complete coverage of a given problem's space. In turn, synthetic data generators have the potential of creating vast amounts of data -- a crucial activity when real-world data is at premium -- while providing a well-understood generation procedure and an interpretable instrument for methodically investigating cluster analysis algorithms. Here, we present Clugen, a modular procedure for synthetic data generation, capable of creating multidimensional clusters supported by line segments using arbitrary distributions. Clugen is open source, comprehensively unit tested and documented, and is available for the Python, R, Julia, and MATLAB/Octave ecosystems. We demonstrate that our proposal can produce rich and varied results in various dimensions, is fit for use in the assessment of clustering algorithms, and has the potential to be a widely used framework in diverse clustering-related research tasks.

Generating Multidimensional Clusters With Support Lines

TL;DR

Cugen is presented, a modular procedure for synthetic data generation, capable of creating multidimensional clusters supported by line segments using arbitrary distributions, and has the potential to be a widely used framework in diverse clustering-related research tasks.

Abstract

Synthetic data is essential for assessing clustering techniques, complementing and extending real data, and allowing for more complete coverage of a given problem's space. In turn, synthetic data generators have the potential of creating vast amounts of data -- a crucial activity when real-world data is at premium -- while providing a well-understood generation procedure and an interpretable instrument for methodically investigating cluster analysis algorithms. Here, we present Clugen, a modular procedure for synthetic data generation, capable of creating multidimensional clusters supported by line segments using arbitrary distributions. Clugen is open source, comprehensively unit tested and documented, and is available for the Python, R, Julia, and MATLAB/Octave ecosystems. We demonstrate that our proposal can produce rich and varied results in various dimensions, is fit for use in the assessment of clustering algorithms, and has the potential to be a widely used framework in diverse clustering-related research tasks.
Paper Structure (23 sections, 12 equations, 12 figures, 7 tables)

This paper contains 23 sections, 12 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Flowchart of the Clugen algorithm. Functions in light blue are stochastic and can be swapped by the user through the optional parameters (Table \ref{['tab:paramsopt']}). The "norm" and "n-1" strings are aliases for the functionality described in Sections \ref{['sec:methods:alg:projs']} and \ref{['sec:methods:alg:points']}, respectively. A stylized 2D example of the algorithm's steps is shown on respective side images. The example was generated with Julia's Clugen implementation and mandatory parameters (Table \ref{['tab:paramsmand']}) set to $n=2$, $c=4$, $p=200$, $\mathbf{d}=(1,1)$, $\theta_\sigma=\pi/16$, $\mathbf{s}=(10,10)$, $l=10$, $l_\sigma=1.5$, and $f_\sigma=1$. Optional parameters were left to their defaults.
  • Figure 2: Possible cluster sizes with $c=4$ and $p=5000$, and $c_s()$ set as follows ((b)--(d) are custom user functions): (a) normal distribution (discretized) with total points correction (the default, implemented by the clusizes() function); (b) discrete uniform distribution with total points correction; (c) Poisson distribution with total points correction; (d) Poisson distribution, no correction.
  • Figure 3: The output of the Clugen algorithm for two different $c_c()$-referenced functions for finding cluster centers: a) the default, using the uniform distribution; b) hand-picked centers. Remaining parameters are the same as in Fig. \ref{['fig:flow']}, except for $p$, which is set to 5000.0.
  • Figure 4: Line lengths for different definitions of $l()$: a) the default, using the folded normal distribution; b) using the Poisson distribution, with $\lambda=l$; c) using the uniform distribution in the interval $\left[0, 2l\right[$; and, d) hand-picked lengths, more specifically $\pmb{\ell}=(2, 8, 16, 32)$. Cluster centers, as well as parameters $l$ and $l_\sigma$, are the same as for the example shown in Fig. \ref{['fig:flow']}.
  • Figure 5: Final directions of the cluster-supporting lines for two definitions of $\theta_\Delta()$: a) the default, where angle differences were obtained using the wrapped normal distribution; and, b) hand-picked angle differences, more specifically $\mathbf{\Theta_\Delta}=(0, \frac{\pi}{2}, 0, \frac{\pi}{2})$. Cluster centers, as well as the angle dispersion $\theta_\sigma$, are the same as for the example shown in Fig. \ref{['fig:flow']}.
  • ...and 7 more figures