A scalable clustering algorithm to approximate graph cuts

Leo Suchan; Housen Li; Axel Munk

A scalable clustering algorithm to approximate graph cuts

Leo Suchan, Housen Li, Axel Munk

TL;DR

This work proposes to utilize the original graph cuts such as Ratio, Normalized or Cheeger Cut to detect clusters in weighted undirected graphs by restricting the graph cut minimization to $st$-MinCut partitions, leading to linear runtime in the number of vertices and quadratic in the number of edges.

Abstract

Due to their computational complexity, graph cuts for cluster detection and identification are used mostly in the form of convex relaxations. We propose to utilize the original graph cuts such as Ratio, Normalized or Cheeger Cut to detect clusters in weighted undirected graphs by restricting the graph cut minimization to $st$-MinCut partitions. Incorporating a vertex selection technique and restricting optimization to tightly connected clusters, we combine the efficient computability of $st$-MinCuts and the intrinsic properties of Gomory-Hu trees with the cut quality of the original graph cuts, leading to linear runtime in the number of vertices and quadratic in the number of edges. Already in simple scenarios, the resulting algorithm Xist is able to approximate graph cut values better empirically than spectral clustering or comparable algorithms, even for large network datasets. We showcase its applicability by segmenting images from cell biology and provide empirical studies of runtime and classification rate.

A scalable clustering algorithm to approximate graph cuts

TL;DR

This work proposes to utilize the original graph cuts such as Ratio, Normalized or Cheeger Cut to detect clusters in weighted undirected graphs by restricting the graph cut minimization to

-MinCut partitions, leading to linear runtime in the number of vertices and quadratic in the number of edges.

Abstract

-MinCut partitions. Incorporating a vertex selection technique and restricting optimization to tightly connected clusters, we combine the efficient computability of

-MinCuts and the intrinsic properties of Gomory-Hu trees with the cut quality of the original graph cuts, leading to linear runtime in the number of vertices and quadratic in the number of edges. Already in simple scenarios, the resulting algorithm Xist is able to approximate graph cut values better empirically than spectral clustering or comparable algorithms, even for large network datasets. We showcase its applicability by segmenting images from cell biology and provide empirical studies of runtime and classification rate.

Paper Structure (17 sections, 7 theorems, 16 equations, 5 figures, 2 tables, 3 algorithms)

This paper contains 17 sections, 7 theorems, 16 equations, 5 figures, 2 tables, 3 algorithms.

Introduction
Definitions and notation
The algorithms
A basic algorithm for imitating graph cuts through st-min cuts
The proposed Xist algorithm
Theoretical properties
Software implementation
Simulations and applications
Approximation of multiway cuts
Empirical runtime comparison
Qualitative assessment of Xist
Clustering large network datasets
Conclusion and discussion
Appendix
Further comparison study
...and 2 more sections

Key Result

Theorem 3.1

alg:xist outputs $\min_{s,t\in\mathop{\mathrm{V^{\mathrm{loc}}}}\nolimits} \mathop{\mathrm{\mathrm{XC}}}\nolimits_{S_{st}}(G)$. Further, if assumptions:uniqueness holds with $s,t\in\mathop{\mathrm{V^{\mathrm{loc}}}}\nolimits$, alg:xist and the alg:xvst yield the same output. If additionally the opti

Figures (5)

Figure 1: Illustration of the \ref{['alg:xist']} for the Ratio Cut functional on a weighted toy graph, where edge thickness is proportional to edge weight. The vector $\tau$ in the \ref{['alg:xist']} determines the vertices $s$ and $t$ for the next $st$-MinCut. (a) depicts the set of local maxima $\mathop{\mathrm{V^{\mathrm{loc}}}}\nolimits=\{1,2,3,4\}$. \ref{['alg:xist']} computes first the $21$-MinCut and updates $\tau$ in (b), then the $31$-MinCut with another update to $\tau$ in (c). Finally, the $43$-MinCut is computed, and $\tau$ is not updated further. Since the $31$-MinCut in (c) gives the best Ratio Cut value among all three partitions, \ref{['alg:xist']} outputs this partition and value.
Figure 2: Image segmentation of microtubules in NIH 3T3 cells. The colors visualize the resulting $k=10$ partitions via \ref{['alg:xist_iterated']} and spectral clustering, respectively. The underlying cell clusters are visible in black, where a darker color marks denser microtubule network.
Figure 3: Comparison of the empirical runtimes (in seconds) of \ref{['alg:xist']} (red), Leiden in its non-oracle form (orange), KaHIP (yellow), METIS (violet), spectral clustering (cyan), and the \ref{['alg:xvst']} (green) on a $\log$-$\log$ scale against the number $n=r^2$ of vertices, with dashed lines indicating the empirical complexity of the respective algorithms. The boxplots were obtained by applying each algorithm to a dataset of 21 NIH3T3 cell cluster images of size $504\times 504$ pixels each after discretizing it onto a $r\times r$ grid, for $r\in\{8,9,12,14,18,21,24,28\}$.
Figure 4: Comparison of the classification rate and NCut value of \ref{['alg:xist']} (red), the oracle version of the Leiden algorithm (orange), KaHIP (yellow), METIS (violet), Chaco (blue), spectral clustering (cyan), and the \ref{['alg:xvst']} (green) over the intercluster distance $\delta$. The graph being partitioned is built from $n=100$ samples from the Gaussian mixture distribution in \ref{['eq:gaussian_mixture']} using a combination of 5-nearest and 0.2-neighbourhood, i.e. by defining $\{u,v\}\in E$ if and only if $u\in\mathrm{NN}_5(v)$ or $v\in\mathrm{NN}_5(u)$ or $\left\lVert u-v\right\rVert\leq 0.2$, where $\mathrm{NN}_5(u)$ denotes the set of five nearest (in terms of the euclidean norm $\left\lVert\cdot\right\rVert$ on $\mathbb{R}^2$) neighbours in $V$ of $u\in V$. The weights are defined as $w_{uv}\coloneqq \exp(-\left\lVert u-v\right\rVert/0.2)$. These choices are made to ensure a connected graph (5-nearest neighbours) where clusters are more easily detectable (0.2-neighbourhood) and the spatial structure is pronounced (exponential weights). The curves depicted where obtained by taking the mean classification rate and mean NCut value over $100$ iterations of the above procedure, re-generating ${\boldsymbol X}$ each time.
Figure 5: Extension of \ref{['img:kncut_cell_cluster_example']} on image segmentation of microtubules in NIH 3T3 cells. The colors visualize the resulting $k\in\{2,\ldots,9\}$ partitions via \ref{['alg:xist_iterated']} and spectral clustering, and the underlying cell clusters are visible in black. The case $k=10$ is depicted in \ref{['img:kncut_cell_cluster_example']}.

Theorems & Definitions (14)

Definition 1: Graph cut
Theorem 3.1
proof
Theorem 3.2
Lemma A.1
proof
Corollary A.2
Lemma A.3
proof
Lemma A.4
...and 4 more

A scalable clustering algorithm to approximate graph cuts

TL;DR

Abstract

A scalable clustering algorithm to approximate graph cuts

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (14)