Extension of the Dip-test Repertoire -- Efficient and Differentiable p-value Calculation for Clustering

Lena G. M. Bauer; Collin Leiber; Christian Böhm; Claudia Plant

Extension of the Dip-test Repertoire -- Efficient and Differentiable p-value Calculation for Clustering

Lena G. M. Bauer, Collin Leiber, Christian Böhm, Claudia Plant

TL;DR

This work tackles the dependence of the Dip-test Dip-p-value on sample size by introducing a differentiable sigmoid-based transformation that maps Dip-values to Dip-p-values for any $N$, enabling efficient and gradient-friendly use in clustering. The authors derive a generalised Richards logistic function $p(x,\theta_N)$ with parameters tied to $N$, producing a smooth, differentiable export from $Dip$ to $p$ and enabling gradient-based optimization. They validate the approach against an extended bootstrapped table, showing lower mean-squared error and competitive runtimes, and integrate the transform into a novel subspace clustering method, Dip'n'Sub, which uses SGD to find projection axes that maximize multimodality across clusters. The results demonstrate reliable Dip-p-value computations across distributions, significant speedups over bootstrapping, and effective, interpretable subspace discovery, highlighting practical impact for large-scale Dip-based clustering and potential extensions to deep learning contexts.

Abstract

Over the last decade, the Dip-test of unimodality has gained increasing interest in the data mining community as it is a parameter-free statistical test that reliably rates the modality in one-dimensional samples. It returns a so called Dip-value and a corresponding probability for the sample's unimodality (Dip-p-value). These two values share a sigmoidal relationship. However, the specific transformation is dependent on the sample size. Many Dip-based clustering algorithms use bootstrapped look-up tables translating Dip- to Dip-p-values for a certain limited amount of sample sizes. We propose a specifically designed sigmoid function as a substitute for these state-of-the-art look-up tables. This accelerates computation and provides an approximation of the Dip- to Dip-p-value transformation for every single sample size. Further, it is differentiable and can therefore easily be integrated in learning schemes using gradient descent. We showcase this by exploiting our function in a novel subspace clustering algorithm called Dip'n'Sub. We highlight in extensive experiments the various benefits of our proposal.

Extension of the Dip-test Repertoire -- Efficient and Differentiable p-value Calculation for Clustering

TL;DR

This work tackles the dependence of the Dip-test Dip-p-value on sample size by introducing a differentiable sigmoid-based transformation that maps Dip-values to Dip-p-values for any

, enabling efficient and gradient-friendly use in clustering. The authors derive a generalised Richards logistic function

with parameters tied to

, producing a smooth, differentiable export from

and enabling gradient-based optimization. They validate the approach against an extended bootstrapped table, showing lower mean-squared error and competitive runtimes, and integrate the transform into a novel subspace clustering method, Dip'n'Sub, which uses SGD to find projection axes that maximize multimodality across clusters. The results demonstrate reliable Dip-p-value computations across distributions, significant speedups over bootstrapping, and effective, interpretable subspace discovery, highlighting practical impact for large-scale Dip-based clustering and potential extensions to deep learning contexts.

Abstract

Paper Structure (15 sections, 8 equations, 6 figures, 3 tables)

This paper contains 15 sections, 8 equations, 6 figures, 3 tables.

Introduction
Related Work
The Dip-test
Bootstrapping
Dip-test Related Data Mining
Common Subspace Clustering
Methods
Table Extension
Function Fit
Dip'n'Sub
Experiments and Results
Reliable Computation
Computing Time
Dip'n'Sub Evaluation
Conclusion

Figures (6)

Figure 1: (a) Bootstrapped $(Dip,p)$-pairs for sample sizes $N = 500$ (purple) and $N = 20k$ (blue) and our fitted functions (green). The transformation function from Dip- to Dip-p-value strongly depends on the sample size. (b) and (c) Histograms of samples from a $\mathcal{N}(0,1)$ normal distribution with sample size $N = 500$ and $N = 20k$. When applying the Dip-test on the two samples, their Dip-values differ with a factor of $10$. The respective Dip-p-values are, however, $0.99$ in both cases.
Figure 2: Our fitted function for the parameter $b_N$.
Figure 3: (a) Hartigan and Hartigan's original bootstrapped table only provides pairs of Dip- and Dip-p-values for $13$ different sample sizes. This table has been extended to values of $N \geq 500$. (b) We enlarge the table for even more values of $N$ as well as a larger granularity regarding $(Dip,p)$-pairs (for better visualisation, we down-sampled our table to every third point). (c) We close the remaining gaps by providing our fitted function, such as for $N=72$, for which we do not have bootstrapped values.
Figure 4: (a) Scatter matrix plot of an $8$-dimensional synthetic data set (colours correspond to ground-truth labels). (b) The horizontal histogram below illustrates the first projection identified by Dip'n'Sub. The data is highly multimodal and therefore divided into $4$ clusters. Dip'n'Sub now uses these cluster assignments to identify a second projection in which all clusters are as multimodal as possible. The second projection is shown vertically in the upper histograms, with respect to each existing cluster individually. It is easy to see that the first two clusters (purple and blue) are subdivided into $3$ and $2$ clusters, respectively. Thereafter, no multimodal third projection can be found. (c) The final clustering result of Dip'n'Sub reveals a clear separation of the clusters.
Figure 5: The Dip'n'Sub algorithm
...and 1 more figures

Extension of the Dip-test Repertoire -- Efficient and Differentiable p-value Calculation for Clustering

TL;DR

Abstract

Extension of the Dip-test Repertoire -- Efficient and Differentiable p-value Calculation for Clustering

Authors

TL;DR

Abstract

Table of Contents

Figures (6)