Table of Contents
Fetching ...

An efficient volume-preserving MBO scheme for data clustering and classification

Fabius Krämer, Tim Laux

TL;DR

The paper advances clustering and classification on graphs by formulating a volume-preserving MBO scheme that enforces exact cluster volumes via a vector-valued $V$-order statistic. It provides an exact, efficient algorithm to compute the volume-constrained thresholding step, with worst-case complexity ${O}(N(\log N + P)P^2)$ and improved, data-driven running times under a gradient-flow big-data regime, achieving ${O}(\sqrt{h}\,N\log N)$ per iteration in favorable settings. A rigorous variational analysis connects the discrete scheme to volume-preserving mean curvature flow, establishing convergence of discrete order statistics to continuous counterparts on manifolds and deriving ${L^2}$-bounds for Lagrange multipliers. The approach is complemented by extensive numerics across multiple diffusion kernels and datasets, showing competitive accuracy with favorable running times, and the authors release public code to facilitate adoption. Overall, the work delivers a principled, scalable framework for volume-aware graph clustering and semi-/unsupervised classification with strong theoretical and practical implications.

Abstract

We propose and study a novel efficient algorithm for clustering and classification tasks based on the famous MBO scheme. On the one hand, inspired by Jacobs et al. [J. Comp. Phys. 2018], we introduce constraints on the size of clusters leading to a linear integer problem. We prove that the solution to this problem is induced by a novel order statistic. This viewpoint allows us to develop exact and highly efficient algorithms to solve such constrained integer problems. On the other hand, we prove an estimate of the computational complexity of our scheme, which is better than any available provable bounds for the state of the art. This rigorous analysis is based on a variational viewpoint that connects this scheme to volume-preserving mean curvature flow in the big data and small time-step limit.

An efficient volume-preserving MBO scheme for data clustering and classification

TL;DR

The paper advances clustering and classification on graphs by formulating a volume-preserving MBO scheme that enforces exact cluster volumes via a vector-valued -order statistic. It provides an exact, efficient algorithm to compute the volume-constrained thresholding step, with worst-case complexity and improved, data-driven running times under a gradient-flow big-data regime, achieving per iteration in favorable settings. A rigorous variational analysis connects the discrete scheme to volume-preserving mean curvature flow, establishing convergence of discrete order statistics to continuous counterparts on manifolds and deriving -bounds for Lagrange multipliers. The approach is complemented by extensive numerics across multiple diffusion kernels and datasets, showing competitive accuracy with favorable running times, and the authors release public code to facilitate adoption. Overall, the work delivers a principled, scalable framework for volume-aware graph clustering and semi-/unsupervised classification with strong theoretical and practical implications.

Abstract

We propose and study a novel efficient algorithm for clustering and classification tasks based on the famous MBO scheme. On the one hand, inspired by Jacobs et al. [J. Comp. Phys. 2018], we introduce constraints on the size of clusters leading to a linear integer problem. We prove that the solution to this problem is induced by a novel order statistic. This viewpoint allows us to develop exact and highly efficient algorithms to solve such constrained integer problems. On the other hand, we prove an estimate of the computational complexity of our scheme, which is better than any available provable bounds for the state of the art. This rigorous analysis is based on a variational viewpoint that connects this scheme to volume-preserving mean curvature flow in the big data and small time-step limit.

Paper Structure

This paper contains 25 sections, 28 theorems, 263 equations, 8 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

Assume $m_{i_1} \geq \dots \geq m_{i_P}$ and denote by $\chi^m$ the by $m$ induced clustering according to eq:m_induced_cluster. Then $\chi^m$ is optimal for alg:inequalityMBO if there exist $b, w \in \{1, \dots, P\}$ with $m_{i_b} \leq m_{i_w}$ such that

Figures (8)

  • Figure 1: Diffusion of labels $e_1,e_2,e_3$ over time $h$.
  • Figure 2: Toy examples for the spectral clustering achieved as limit of the volume constrained MBO scheme.
  • Figure 3: Order statistic (black point) for two (a), three (b) and four (c) clusters and the induced clustering into red, green, blue and purple points.
  • Figure 4: $\{4,4,4\}$-order statistic $m$ in black. The colors represent a $m$-induced clustering $\chi^m$. The points $a$and $b$on the hyperplane $H_{\;\color{phase1}{\put(0,2.4){\circle*{4.5}}}\, \color{phase2} \blacktriangle\color{black}}\!(m)$ could be either assigned to the (-7,3.4)*6.5 or $\blacktriangle$phase.
  • Figure 5: Visualization of first steps of Algorithm \ref{['alg:median']} for $V_1 = V_2 = V_3 = 5$.
  • ...and 3 more figures

Theorems & Definitions (63)

  • Theorem 1
  • Theorem : Informal version of Theorem \ref{['the:improvedRunning']}
  • Definition 1
  • Lemma 1
  • proof
  • proof : Proof of Theorem \ref{['the:opt_criterium']}
  • Corollary 1
  • proof
  • Theorem 2: Correctness of the algorithm
  • proof
  • ...and 53 more