Table of Contents
Fetching ...

Solving the Correlation Cluster LP in Sublinear Time

Nairen Cao, Vincent Cohen-Addad, Shi Li, Euiwoong Lee, David Rasmussen Lolck, Alantha Newman, Mikkel Thorup, Lukas Vogl, Shuyi Yan, Hanwen Zhang

TL;DR

The paper tackles Correlation Clustering by leveraging the cluster LP, which, despite its exponential size, can be approximated efficiently. It introduces a sublinear-time method to compute a near-optimal LP solution and extends this to a rounding procedure that achieves a $(1.485+\varepsilon)$-approximation for Correlation Clustering, matching state-of-the-art polynomial-time results with significantly reduced running time. Central to the approach are (i) a multiplicative-weights-based solver for a covering reformulation of the cluster LP, (ii) a preclustering step that yields structured atoms and admissible edges to guide clustering, (iii) a partial clustering strategy that iteratively extracts small-ratio clusters to cover a constant fraction of mass, and (iv) scalable MPC and sublinear implementations of both LP solving and rounding. Together, these techniques bridge the gap between high-accuracy approximation algorithms and fast, scalable clustering, enabling practical applications in large-scale data analysis where only sublinear or near-linear runtimes are feasible.

Abstract

Correlation Clustering is a fundamental and widely-studied problem in unsupervised learning and data mining. The input is a graph and the goal is to construct a clustering minimizing the number of inter-cluster edges plus the number of missing intra-cluster edges. CCL+24 introduced the cluster LP for Correlation Clustering, which they argued captures the problem much more succinctly than previous linear programming formulations. However, the cluster LP has exponential size, with a variable for every possible set of vertices in the input graph. Nevertheless, CCL+24 showed how to find a feasible solution for the cluster LP in time $O(n^{\text{poly}(1/ε)})$ with objective value at most $(1+ε)$ times the value of an optimal solution for the respective Correlation Clustering instance. Furthermore, they showed how to round a solution to the cluster LP, yielding a $(1.485+ε)$-approximation algorithm for the Correlation Clustering problem. The main technical result of this paper is a new approach to find a feasible solution for the cluster LP with objective value at most $(1+ε)$ of the optimum in time $\widetilde O(2^{\text{poly}(1/ε)} n)$, where $n$ is the number of vertices in the graph. We also show how to implement the rounding within the same time bounds, thus achieving a fast $(1.485+ε)$-approximation algorithm for the Correlation Clustering problem. This bridges the gap between state-of-the-art methods for approximating Correlation Clustering and the recent focus on fast algorithms.

Solving the Correlation Cluster LP in Sublinear Time

TL;DR

The paper tackles Correlation Clustering by leveraging the cluster LP, which, despite its exponential size, can be approximated efficiently. It introduces a sublinear-time method to compute a near-optimal LP solution and extends this to a rounding procedure that achieves a -approximation for Correlation Clustering, matching state-of-the-art polynomial-time results with significantly reduced running time. Central to the approach are (i) a multiplicative-weights-based solver for a covering reformulation of the cluster LP, (ii) a preclustering step that yields structured atoms and admissible edges to guide clustering, (iii) a partial clustering strategy that iteratively extracts small-ratio clusters to cover a constant fraction of mass, and (iv) scalable MPC and sublinear implementations of both LP solving and rounding. Together, these techniques bridge the gap between high-accuracy approximation algorithms and fast, scalable clustering, enabling practical applications in large-scale data analysis where only sublinear or near-linear runtimes are feasible.

Abstract

Correlation Clustering is a fundamental and widely-studied problem in unsupervised learning and data mining. The input is a graph and the goal is to construct a clustering minimizing the number of inter-cluster edges plus the number of missing intra-cluster edges. CCL+24 introduced the cluster LP for Correlation Clustering, which they argued captures the problem much more succinctly than previous linear programming formulations. However, the cluster LP has exponential size, with a variable for every possible set of vertices in the input graph. Nevertheless, CCL+24 showed how to find a feasible solution for the cluster LP in time with objective value at most times the value of an optimal solution for the respective Correlation Clustering instance. Furthermore, they showed how to round a solution to the cluster LP, yielding a -approximation algorithm for the Correlation Clustering problem. The main technical result of this paper is a new approach to find a feasible solution for the cluster LP with objective value at most of the optimum in time , where is the number of vertices in the graph. We also show how to implement the rounding within the same time bounds, thus achieving a fast -approximation algorithm for the Correlation Clustering problem. This bridges the gap between state-of-the-art methods for approximating Correlation Clustering and the recent focus on fast algorithms.

Paper Structure

This paper contains 54 sections, 43 theorems, 183 equations, 1 figure, 12 algorithms.

Key Result

Theorem 1

Let $\varepsilon, \delta > 0$ be small enough constants and let $\mathrm{OPT}$ be the cost of the optimum solution to the given Correlation Clustering instance. Then there is a small $\Delta = {\mathrm{poly}}(\varepsilon)$ such that the following statement holds. One can output a solution $(z_S)_{S

Figures (1)

  • Figure 1: Illustration of the sets $\hat{T}_i$, $C^*_i$, and $Q_i$. The rectangle represents $D(r)$, divided into $\eta$ parts. The red region denotes $\hat{T}_i$, containing all vertices already added to $\hat{T}$. The set $C^*_i$ includes both the red and blue regions. In $D^{i+1}_r$, the algorithm attempts to include as many vertices as possible in $C^*_i$; the yellow region represents the newly added vertices in $\hat{T}$. Claim \ref{['claim:invariant-4.2']} states that the yellow and blue regions have significant overlap.

Theorems & Definitions (108)

  • Theorem 1: Efficient \ref{['LP:clusterlp']}
  • Theorem 2: Efficient Rounding Algorithm
  • Corollary 3
  • Definition 4: Preclustering
  • Definition 5: $\varepsilon$-similar Preclustering
  • Definition 6: $\varepsilon$-large Cluster
  • Theorem 7: Preclustering Procedures cohen2024combinatorial
  • Lemma 8
  • proof
  • Lemma 9
  • ...and 98 more