Table of Contents
Fetching ...

A Multivariate Unimodality Test Harnessing the Dip Statistic of Mahalanobis Distances Over Random Projections

Prodromos Kolyvakis, Aristidis Likas

TL;DR

This work tackles the challenge of testing unimodality in multidimensional data by introducing mud-pod, a multivariate unimodality test for distributions in $ℝ^d$ under the $α$-unimodality family. The method uses Mahalanobis distances from random observer points, random projections that preserve distances via the Johnson–Lindenstrauss lemma, and applies the univariate dip test on multiple views, combined via Monte Carlo to decide $H_0: X ∼ P_{α}$. It provides a mathematical foundation (Decomposition theorem, translation/norm/projection properties) and demonstrates consistency, with empirical validation on synthetic and real-world datasets; mp-means further demonstrates automatic cluster-count estimation with competitive performance against standard clustering methods. The results highlight the benefits of the RP space, percentile-based observer selection, and the Mahalanobis distance in enhancing unimodality detection and clustering robustness, supporting practical applicability across diverse data domains.

Abstract

Unimodality, pivotal in statistical analysis, offers insights into dataset structures and drives sophisticated analytical procedures. While unimodality's confirmation is straightforward for one-dimensional data using methods like Silverman's approach and Hartigans' dip statistic, its generalization to higher dimensions remains challenging. By extrapolating one-dimensional unimodality principles to multi-dimensional spaces through linear random projections and leveraging point-to-point distancing, our method, rooted in $α$-unimodality assumptions, presents a novel multivariate unimodality test named mud-pod. Both theoretical and empirical studies confirm the efficacy of our method in unimodality assessment of multidimensional datasets as well as in estimating the number of clusters.

A Multivariate Unimodality Test Harnessing the Dip Statistic of Mahalanobis Distances Over Random Projections

TL;DR

This work tackles the challenge of testing unimodality in multidimensional data by introducing mud-pod, a multivariate unimodality test for distributions in under the -unimodality family. The method uses Mahalanobis distances from random observer points, random projections that preserve distances via the Johnson–Lindenstrauss lemma, and applies the univariate dip test on multiple views, combined via Monte Carlo to decide . It provides a mathematical foundation (Decomposition theorem, translation/norm/projection properties) and demonstrates consistency, with empirical validation on synthetic and real-world datasets; mp-means further demonstrates automatic cluster-count estimation with competitive performance against standard clustering methods. The results highlight the benefits of the RP space, percentile-based observer selection, and the Mahalanobis distance in enhancing unimodality detection and clustering robustness, supporting practical applicability across diverse data domains.

Abstract

Unimodality, pivotal in statistical analysis, offers insights into dataset structures and drives sophisticated analytical procedures. While unimodality's confirmation is straightforward for one-dimensional data using methods like Silverman's approach and Hartigans' dip statistic, its generalization to higher dimensions remains challenging. By extrapolating one-dimensional unimodality principles to multi-dimensional spaces through linear random projections and leveraging point-to-point distancing, our method, rooted in -unimodality assumptions, presents a novel multivariate unimodality test named mud-pod. Both theoretical and empirical studies confirm the efficacy of our method in unimodality assessment of multidimensional datasets as well as in estimating the number of clusters.
Paper Structure (16 sections, 5 theorems, 7 equations, 1 figure, 4 tables, 1 algorithm)

This paper contains 16 sections, 5 theorems, 7 equations, 1 figure, 4 tables, 1 algorithm.

Key Result

Lemma 3.1

Let $\mathbf{X} \sim \mathcal{P_{\alpha}}$ and $\mathbf{c} \in \mathbb{R}^d$, then $\mathbf{X} + \mathbf{c} \sim \mathcal{P_{\alpha}}$ .

Figures (1)

  • Figure 1: The plot shows the relative error between the estimated and actual number of clusters generated by mp-means, against increasing Monte Carlo simulations. Ten executions per experiment, variance depicted. The plot is more discernible in color.

Theorems & Definitions (9)

  • Lemma 3.1: Translation Property
  • proof
  • Lemma 3.2: Norm Property
  • proof
  • Lemma 3.3: Projection Property
  • proof
  • Lemma 3.4: Mahalanobis
  • proof
  • Proposition 3.5: Randomisation Hypothesis