Table of Contents
Fetching ...

On the Sparsifiability of Correlation Clustering: Approximation Guarantees under Edge Sampling

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

TL;DR

It is shown via Yao's minimax principle that without pseudometric structure, any algorithm observing $o(n)$ uniformly random edges incurs an unbounded approximation ratio, demonstrating that the pseudometric condition governs not only tractability but also the robustness of CC to incomplete information.

Abstract

Correlation Clustering (CC) is a fundamental unsupervised learning primitive whose strongest LP-based approximation guarantees require $Θ(n^3)$ triangle inequality constraints and are prohibitive at scale. We initiate the study of \emph{sparsification--approximation trade-offs} for CC, asking how much edge information is needed to retain LP-based guarantees. We establish a structural dichotomy between pseudometric and general weighted instances. On the positive side, we prove that the VC dimension of the clustering disagreement class is exactly $n{-}1$, yielding additive $\varepsilon$-coresets of optimal size $\tilde{O}(n/\varepsilon^2)$; that at most $\binom{n}{2}$ triangle inequalities are active at any LP vertex, enabling an exact cutting-plane solver; and that a sparsified variant of LP-PIVOT, which imputes missing LP marginals via triangle inequalities, achieves a robust $\frac{10}{3}$-approximation (up to an additive term controlled by an empirically computable imputation-quality statistic $\overlineΓ_w$) once $\tildeΘ(n^{3/2})$ edges are observed, a threshold we prove is sharp. On the negative side, we show via Yao's minimax principle that without pseudometric structure, any algorithm observing $o(n)$ uniformly random edges incurs an unbounded approximation ratio, demonstrating that the pseudometric condition governs not only tractability but also the robustness of CC to incomplete information.

On the Sparsifiability of Correlation Clustering: Approximation Guarantees under Edge Sampling

TL;DR

It is shown via Yao's minimax principle that without pseudometric structure, any algorithm observing uniformly random edges incurs an unbounded approximation ratio, demonstrating that the pseudometric condition governs not only tractability but also the robustness of CC to incomplete information.

Abstract

Correlation Clustering (CC) is a fundamental unsupervised learning primitive whose strongest LP-based approximation guarantees require triangle inequality constraints and are prohibitive at scale. We initiate the study of \emph{sparsification--approximation trade-offs} for CC, asking how much edge information is needed to retain LP-based guarantees. We establish a structural dichotomy between pseudometric and general weighted instances. On the positive side, we prove that the VC dimension of the clustering disagreement class is exactly , yielding additive -coresets of optimal size ; that at most triangle inequalities are active at any LP vertex, enabling an exact cutting-plane solver; and that a sparsified variant of LP-PIVOT, which imputes missing LP marginals via triangle inequalities, achieves a robust -approximation (up to an additive term controlled by an empirically computable imputation-quality statistic ) once edges are observed, a threshold we prove is sharp. On the negative side, we show via Yao's minimax principle that without pseudometric structure, any algorithm observing uniformly random edges incurs an unbounded approximation ratio, demonstrating that the pseudometric condition governs not only tractability but also the robustness of CC to incomplete information.
Paper Structure (43 sections, 17 theorems, 37 equations, 6 figures, 5 tables, 2 algorithms)

This paper contains 43 sections, 17 theorems, 37 equations, 6 figures, 5 tables, 2 algorithms.

Key Result

Theorem 3.2

For any signed complete graph $G$ on $n \geq 3$ vertices, $\mathop{\mathrm{VC}}\nolimits(\mathcal{H}_G) = n - 1$.

Figures (6)

  • Figure 1: Witness density phase transition. The fraction of vertex pairs with $\geq 1$ witness exhibits a sharp threshold at $m/n^{3/2} \approx 1$ (red dashed line), independent of $n$.
  • Figure 2: Approximation ratio vs. sample budget on SBM ($n=50$, $k=5$). Sparse-LP-Pivot converges to Full LP-PIVOT at the witness threshold $m/n^{3/2} \approx 1$.
  • Figure 3: Additive coreset error vs. sample size. The dashed line shows the theoretical $\Theta(\sqrt{n/m})$ decay rate.
  • Figure 4: Left: active vs. total triangle constraints at the LP vertex (log scale). Right: active constraint ratio decreases with $n$.
  • Figure 5: $\overline{\Gamma}_w$ as a performance diagnostic. Each point is one (dataset, sample size) configuration. The dashed line shows the theoretical bound from Theorem \ref{['thm:robust-sparse-pivot']}.
  • ...and 1 more figures

Theorems & Definitions (58)

  • Definition 2.1: Pseudometric weights
  • Definition 3.1: Additive edge coreset
  • Theorem 3.2: VC dimension of clustering disagreements
  • proof : Proof sketch
  • Theorem 3.3: Additive edge coreset for CC
  • Theorem 4.1: Active triangle inequalities at a vertex
  • Lemma 4.2: Cutting-plane correctness
  • Definition 5.1: Triangle imputation
  • Lemma 5.2: Witness density threshold
  • Definition 5.3: Good-witness fraction
  • ...and 48 more