Table of Contents
Fetching ...

ClustML: A Measure of Cluster Pattern Complexity in Scatterplots Learnt from Human-labeled Groupings

Mostafa M. Abbas, Ehsan Ullah, Abdelkader Baggag, Halima Bensmail, Michael Sedlmair, Michaël Aupetit

TL;DR

ClustML addresses the challenge of quantifying perceptual clustering in scatterplots by learning a merging function from human judgments within a Gaussian Mixture Model framework. It replaces the ClustMe heuristic merging with a data-driven binary classifier trained on S1 human judgments, achieving near-perfect alignment with perceptual data and improved ranking accuracy on S2. The approach yields a higher-fidelity VQM for cluster patterns, demonstrated via experiments and a genomic data usage scenario, and it provides benchmark datasets and open benchmarks for future work. This hybrid perceptual-computational model enhances scalable visual analytics by enabling more reliable detection of complex subspace clustering patterns in high-dimensional data such as GWAS kinship analyses.

Abstract

Visual quality measures (VQMs) are designed to support analysts by automatically detecting and quantifying patterns in visualizations. We propose a new VQM for visual grouping patterns in scatterplots, called ClustML, which is trained on previously collected human subject judgments. Our model encodes scatterplots in the parametric space of a Gaussian Mixture Model and uses a classifier trained on human judgment data to estimate the perceptual complexity of grouping patterns. The numbers of initial mixture components and final combined groups. It improves on existing VQMs, first, by better estimating human judgments on two-Gaussian cluster patterns and, second, by giving higher accuracy when ranking general cluster patterns in scatterplots. We use it to analyze kinship data for genome-wide association studies, in which experts rely on the visual analysis of large sets of scatterplots. We make the benchmark datasets and the new VQM available for practical use and further improvements.

ClustML: A Measure of Cluster Pattern Complexity in Scatterplots Learnt from Human-labeled Groupings

TL;DR

ClustML addresses the challenge of quantifying perceptual clustering in scatterplots by learning a merging function from human judgments within a Gaussian Mixture Model framework. It replaces the ClustMe heuristic merging with a data-driven binary classifier trained on S1 human judgments, achieving near-perfect alignment with perceptual data and improved ranking accuracy on S2. The approach yields a higher-fidelity VQM for cluster patterns, demonstrated via experiments and a genomic data usage scenario, and it provides benchmark datasets and open benchmarks for future work. This hybrid perceptual-computational model enhances scalable visual analytics by enabling more reliable detection of complex subspace clustering patterns in high-dimensional data such as GWAS kinship analyses.

Abstract

Visual quality measures (VQMs) are designed to support analysts by automatically detecting and quantifying patterns in visualizations. We propose a new VQM for visual grouping patterns in scatterplots, called ClustML, which is trained on previously collected human subject judgments. Our model encodes scatterplots in the parametric space of a Gaussian Mixture Model and uses a classifier trained on human judgment data to estimate the perceptual complexity of grouping patterns. The numbers of initial mixture components and final combined groups. It improves on existing VQMs, first, by better estimating human judgments on two-Gaussian cluster patterns and, second, by giving higher accuracy when ranking general cluster patterns in scatterplots. We use it to analyze kinship data for genome-wide association studies, in which experts rely on the visual analysis of large sets of scatterplots. We make the benchmark datasets and the new VQM available for practical use and further improvements.

Paper Structure

This paper contains 22 sections, 13 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: A visual quality measure (VQM) based on a Gaussian Mixture Model (GMM) for cluster patterns in scatterplots is made of three stages: (1) a data-driven process estimates the parameters of a GMM of the data points density in the scatterplot; (2) the degree of overlapping of each pair of GMM components is computed to provide additional characteristics of interest to quantify cluster patterns; (3) The data points, the GMM parameters, and the pairwise quantities are aggregated to compute the visual quality measure. ClustMe and ClustML are both GMM-based VQMS, differing in the way they quantify pairwise overlap of GMM components (Stage 2).
  • Figure 2: ClustMe and ClustML are GMM-based VQMs for cluster patterns. (a) The VQM pipeline of ClustMe uses a heuristic (Demp) as a merging decision function for each pair of GMM components. (b) ClustML follows the same pipeline as ClustMe but uses an automatic classifier as a merging decision function (green) trained on $1000$ monochrome scatterplots from a previous study ClustMe_eurovis2019. These scatterplots were generated in study S1 from varying the parameters $\phi_{uv}$ of a GMM with 2 components and labeled by $34$ subjects $(H_1,...,H_{34})$ seeing one ($H_n=0$) or more-than-one ($H_n=1$) clusters.
  • Figure 3: ClustML measures the amount of grouping in scatterplots based on a classifier trained on human judgments: A bivariate Gaussian Mixture Model (Stage 1) models the distribution of the points in the scatterplot to evaluate. Each possible pair of its $K^*$ Gaussian components is assessed for merging (Stage 2). For that purpose and as the main novelty of that work, a binary classifier $\mathcal{G}$ has been trained in the parameter space $\Phi_{uv}$ of component pairs $(u,v)$ (red and blue dots on the right; actually, this space has $8$ dimensions). Scatterplots (Solid red and blue frames) generated by $1000$ pairs have been labeled in a previous experiment ClustMe_eurovis2019 by $34$ subjects tasked to decide whether each scatterplot shows one (Red) or more-than-one (Blue) clusters. Five such "Training" scatterplots with plain-line blue or red frames are displayed, and four others in the right column with the percentage of subjects seeing more-than-one cluster. After training, the classifier $\mathcal{G}_{ClustML}$ automatically predicts the merging decision (Green solid line separating blue and red areas) that humans would take for yet unseen $2$-Gaussian scatterplots (Dashed green frames). This GMM component pairwise merging decision generates a set of $M$ connected components (purple frame). Finally, the ClustML VQM (Stage 3) of the evaluated scatterplot is given by the pair $(M,K^*)$; the higher the score, the more complex the grouping pattern.
  • Figure 4: Parameters ${\phi}_{uv}=(\tau,\mu,\sigma_u^x,\sigma_u^y,\sigma_v^x,\sigma_v^y,\theta_u,\theta_v)$ of a pair of Gaussian components $(u,v)$ of $\mathcal{M}^*$ control the direction ($\theta$), the probability ($\tau$), the extent ($\sigma$), and the distance ($\mu$) of the two component distributions, hence the (perceptual) overlap of their sampled data. These parameter vectors span the feature space $\mathcal{S}$ (Figure \ref{['fig:ClustML_overview']}) input of the classifier $\mathcal{G}_{ClustML}$ taking decision of merging $u$ and $v$.
  • Figure 5: Data augmentation process: (a) We expect that each set of parameters of a pair of GMM components corresponds to a unique scatterplot up to the sampling variability and vice-versa. But there are symmetries for some settings of these parameters or some scatterplots. (b) Parameters of a pair of components (A, B, C, D) can be different while they represent the exact same cluster pattern in the scatterplot respectively (A', B', C', D') due to symmetry or rotation of the group of points in the scatterplot. (c) Data augmentation involves exploiting these known symmetries to generate additional data (A', B', C', D') with labels corresponding to their symmetrical version (A, B, C, D), enriching the dataset and improving classifier generalizability. (d) GMMs can model the same scatterplot with different parameters, leading to different locations in the feature space. (e) We generate new data in the feature space leading to the same scatterplot, hence the same label. In all cases (c, e), we need to cover the feature space with labeled examples better to support the training of the classifier; otherwise, the classifier will generalize poorly in these areas (Left side, b, d). The human judgment dataset S1 does not contain such symmetries because it has been designed to avoid showing twice the same scatterplot to human subjects. Therefore, we need to augment these data in the feature space by duplicating labeled scatterplots considering these symmetries (Right side, c, e).
  • ...and 4 more figures