Table of Contents
Fetching ...

MC-GTA: Metric-Constrained Model-Based Clustering using Goodness-of-fit Tests with Autocorrelations

Zhangyu Wang, Gengchen Mai, Krzysztof Janowicz, Ni Lao

TL;DR

MC-GTA addresses clustering under metric constraints by explicitly modeling metric autocorrelation through a generalized model-based semivariogram and a pairwise Wasserstein-2 dissimilarity between Gaussian Markov Random Field models. It reframes clustering as minimizing a total hinge loss over intra-cluster pairs that fail goodness-of-fit tests, with a range-based penalty controlled by a margin, enabling a stable, EM-free optimization. Empirical results across 1D and 2D datasets show state-of-the-art ARI/NMI gains (up to 14.3%/32.1%) and substantial speedups (over 10x) compared with TICC/STICC, highlighting better scalability and robustness. The approach yields interpretable, distribution-aware clustering that naturally accommodates temporal/spatial constraints and can be extended to broader non-Gaussian MRF settings.

Abstract

A wide range of (multivariate) temporal (1D) and spatial (2D) data analysis tasks, such as grouping vehicle sensor trajectories, can be formulated as clustering with given metric constraints. Existing metric-constrained clustering algorithms overlook the rich correlation between feature similarity and metric distance, i.e., metric autocorrelation. The model-based variations of these clustering algorithms (e.g. TICC and STICC) achieve SOTA performance, yet suffer from computational instability and complexity by using a metric-constrained Expectation-Maximization procedure. In order to address these two problems, we propose a novel clustering algorithm, MC-GTA (Model-based Clustering via Goodness-of-fit Tests with Autocorrelations). Its objective is only composed of pairwise weighted sums of feature similarity terms (square Wasserstein-2 distance) and metric autocorrelation terms (a novel multivariate generalization of classic semivariogram). We show that MC-GTA is effectively minimizing the total hinge loss for intra-cluster observation pairs not passing goodness-of-fit tests, i.e., statistically not originating from the same distribution. Experiments on 1D/2D synthetic and real-world datasets demonstrate that MC-GTA successfully incorporates metric autocorrelation. It outperforms strong baselines by large margins (up to 14.3% in ARI and 32.1% in NMI) with faster and stabler optimization (>10x speedup).

MC-GTA: Metric-Constrained Model-Based Clustering using Goodness-of-fit Tests with Autocorrelations

TL;DR

MC-GTA addresses clustering under metric constraints by explicitly modeling metric autocorrelation through a generalized model-based semivariogram and a pairwise Wasserstein-2 dissimilarity between Gaussian Markov Random Field models. It reframes clustering as minimizing a total hinge loss over intra-cluster pairs that fail goodness-of-fit tests, with a range-based penalty controlled by a margin, enabling a stable, EM-free optimization. Empirical results across 1D and 2D datasets show state-of-the-art ARI/NMI gains (up to 14.3%/32.1%) and substantial speedups (over 10x) compared with TICC/STICC, highlighting better scalability and robustness. The approach yields interpretable, distribution-aware clustering that naturally accommodates temporal/spatial constraints and can be extended to broader non-Gaussian MRF settings.

Abstract

A wide range of (multivariate) temporal (1D) and spatial (2D) data analysis tasks, such as grouping vehicle sensor trajectories, can be formulated as clustering with given metric constraints. Existing metric-constrained clustering algorithms overlook the rich correlation between feature similarity and metric distance, i.e., metric autocorrelation. The model-based variations of these clustering algorithms (e.g. TICC and STICC) achieve SOTA performance, yet suffer from computational instability and complexity by using a metric-constrained Expectation-Maximization procedure. In order to address these two problems, we propose a novel clustering algorithm, MC-GTA (Model-based Clustering via Goodness-of-fit Tests with Autocorrelations). Its objective is only composed of pairwise weighted sums of feature similarity terms (square Wasserstein-2 distance) and metric autocorrelation terms (a novel multivariate generalization of classic semivariogram). We show that MC-GTA is effectively minimizing the total hinge loss for intra-cluster observation pairs not passing goodness-of-fit tests, i.e., statistically not originating from the same distribution. Experiments on 1D/2D synthetic and real-world datasets demonstrate that MC-GTA successfully incorporates metric autocorrelation. It outperforms strong baselines by large margins (up to 14.3% in ARI and 32.1% in NMI) with faster and stabler optimization (>10x speedup).
Paper Structure (35 sections, 17 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 35 sections, 17 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Motivation of MC-GTA using iNaturalist-2018 dataset as an example. We wish to cluster wild animal photos based on both their image similarity and spatial adjacency. For any pair of observations, we obtain their metric distance $d_c$ and generalized model-based semivariance $W_2^2$ (square Wasserstein-2 distance), which quantifies feature similarity via underlying models. In the presence of metric autocorrelation, the expected generalized model-based semivariance is in theory an increasing function of $d_c$ within range $\rho$ and levels off beyond $\rho$, namely theoretical generalized model-based semivariogram$\gamma_m$. We fit $\gamma_m$ from the empirical generalized model-based semivariogram $\hat{\gamma}_m$. MC-GTA penalizes observation pairs whose $W_2^2$ is close to or exceeding $\gamma_m$ via a hinge loss with margin $\delta$. An observation pair having no hinge loss penalty equals passing a goodness-of-fit test with significance level $\delta$.
  • Figure 2: Empirical generalized model-based semivariogram under different hyperparameter settings. $n$ is the number of nearest neighbors used for fitting the GMRF models for each observation. The color represents the percentage of observation pairs that belong to the same ground-truth cluster in each $0.0001\text{(geodesic)} \times 0.01\text{(Wasserstein-2)}$ bin.
  • Figure 3: The histograms of pairwise distance between intra-cluster and inter-cluster observations in the Pavement dataset. The distributions of intra/inter-cluster Wasserstein-2 distance show more distinctive patterns than those of cosine distance and Euclidean distance.
  • Figure 4: Comparison of the Climate dataset, the iNaturalist-2018 dataset, the NYU POI/Land-use datasets. For visualization we only plot a subset of iNaturalist-2018 (California, two species). It is obvious that 1) the Climate dataset is very sparse and the ground-truth clusters have clear-cut borders, and 2) the iNaturalist/NYU datasets are dense and the ground-truth clusters overlap each other.
  • Figure 5: The performance curve with regard to the grid-searched hyperparameters $n$, $\beta$ and $\delta$
  • ...and 1 more figures