On the Sparsifiability of Correlation Clustering: Approximation Guarantees under Edge Sampling

Ibne Farabi Shihab; Sanjeda Akter; Anuj Sharma

On the Sparsifiability of Correlation Clustering: Approximation Guarantees under Edge Sampling

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

TL;DR

It is shown via Yao's minimax principle that without pseudometric structure, any algorithm observing $o(n)$ uniformly random edges incurs an unbounded approximation ratio, demonstrating that the pseudometric condition governs not only tractability but also the robustness of CC to incomplete information.

Abstract

Correlation Clustering (CC) is a fundamental unsupervised learning primitive whose strongest LP-based approximation guarantees require $Θ(n^3)$ triangle inequality constraints and are prohibitive at scale. We initiate the study of \emph{sparsification--approximation trade-offs} for CC, asking how much edge information is needed to retain LP-based guarantees. We establish a structural dichotomy between pseudometric and general weighted instances. On the positive side, we prove that the VC dimension of the clustering disagreement class is exactly $n{-}1$, yielding additive $\varepsilon$-coresets of optimal size $\tilde{O}(n/\varepsilon^2)$; that at most $\binom{n}{2}$ triangle inequalities are active at any LP vertex, enabling an exact cutting-plane solver; and that a sparsified variant of LP-PIVOT, which imputes missing LP marginals via triangle inequalities, achieves a robust $\frac{10}{3}$-approximation (up to an additive term controlled by an empirically computable imputation-quality statistic $\overlineΓ_w$) once $\tildeΘ(n^{3/2})$ edges are observed, a threshold we prove is sharp. On the negative side, we show via Yao's minimax principle that without pseudometric structure, any algorithm observing $o(n)$ uniformly random edges incurs an unbounded approximation ratio, demonstrating that the pseudometric condition governs not only tractability but also the robustness of CC to incomplete information.

On the Sparsifiability of Correlation Clustering: Approximation Guarantees under Edge Sampling

TL;DR

It is shown via Yao's minimax principle that without pseudometric structure, any algorithm observing

uniformly random edges incurs an unbounded approximation ratio, demonstrating that the pseudometric condition governs not only tractability but also the robustness of CC to incomplete information.

Abstract

Correlation Clustering (CC) is a fundamental unsupervised learning primitive whose strongest LP-based approximation guarantees require

triangle inequality constraints and are prohibitive at scale. We initiate the study of \emph{sparsification--approximation trade-offs} for CC, asking how much edge information is needed to retain LP-based guarantees. We establish a structural dichotomy between pseudometric and general weighted instances. On the positive side, we prove that the VC dimension of the clustering disagreement class is exactly

, yielding additive

-coresets of optimal size

; that at most

triangle inequalities are active at any LP vertex, enabling an exact cutting-plane solver; and that a sparsified variant of LP-PIVOT, which imputes missing LP marginals via triangle inequalities, achieves a robust

-approximation (up to an additive term controlled by an empirically computable imputation-quality statistic

) once

edges are observed, a threshold we prove is sharp. On the negative side, we show via Yao's minimax principle that without pseudometric structure, any algorithm observing

uniformly random edges incurs an unbounded approximation ratio, demonstrating that the pseudometric condition governs not only tractability but also the robustness of CC to incomplete information.

Paper Structure (43 sections, 17 theorems, 37 equations, 6 figures, 5 tables, 2 algorithms)

This paper contains 43 sections, 17 theorems, 37 equations, 6 figures, 5 tables, 2 algorithms.

Introduction
Contributions
Related Work
Preliminaries
Edge Coresets for Correlation Clustering
Constraint Sparsification
LP-PIVOT on Sparse Instances
Witness Density Threshold
Good-Witness Condition and Approximation Guarantees
Lower Bound for General Weighted CC
Experimental Validation
Conclusion
Deferred Preliminaries
Proof Techniques Overview
LP-PIVOT Algorithm
...and 28 more sections

Key Result

Theorem 3.2

For any signed complete graph $G$ on $n \geq 3$ vertices, $\mathop{\mathrm{VC}}\nolimits(\mathcal{H}_G) = n - 1$.

Figures (6)

Figure 1: Witness density phase transition. The fraction of vertex pairs with $\geq 1$ witness exhibits a sharp threshold at $m/n^{3/2} \approx 1$ (red dashed line), independent of $n$.
Figure 2: Approximation ratio vs. sample budget on SBM ($n=50$, $k=5$). Sparse-LP-Pivot converges to Full LP-PIVOT at the witness threshold $m/n^{3/2} \approx 1$.
Figure 3: Additive coreset error vs. sample size. The dashed line shows the theoretical $\Theta(\sqrt{n/m})$ decay rate.
Figure 4: Left: active vs. total triangle constraints at the LP vertex (log scale). Right: active constraint ratio decreases with $n$.
Figure 5: $\overline{\Gamma}_w$ as a performance diagnostic. Each point is one (dataset, sample size) configuration. The dashed line shows the theoretical bound from Theorem \ref{['thm:robust-sparse-pivot']}.
...and 1 more figures

Theorems & Definitions (58)

Definition 2.1: Pseudometric weights
Definition 3.1: Additive edge coreset
Theorem 3.2: VC dimension of clustering disagreements
proof : Proof sketch
Theorem 3.3: Additive edge coreset for CC
Theorem 4.1: Active triangle inequalities at a vertex
Lemma 4.2: Cutting-plane correctness
Definition 5.1: Triangle imputation
Lemma 5.2: Witness density threshold
Definition 5.3: Good-witness fraction
...and 48 more

On the Sparsifiability of Correlation Clustering: Approximation Guarantees under Edge Sampling

TL;DR

Abstract

On the Sparsifiability of Correlation Clustering: Approximation Guarantees under Edge Sampling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (58)