Table of Contents
Fetching ...

Correlation Clustering and (De)Sparsification: Graph Sketches Can Match Classical Algorithms

Sepehr Assadi, Sanjeev Khanna, Aaron Putterman

TL;DR

This work develops a graph-sketching framework for correlation clustering that achieves near-optimal polynomial-time guarantees in a sublinear setting. The core idea is a de-sparsification paradigm that first recovers a fractional or spectral sparsifier and then rounds it to a simple graph, preserving cut structure sufficiently to retain the CC value within a small factor. By coupling this with convex optimization and effective-resistance-based sampling, the authors obtain a linear sketch of size ~Ŝ(n) that enables $(oldsymbol{ ext{α}}_{ ext{best}}+o(1))$-approximate CC on any $n$-vertex graph in polynomial time, and they translate these results into sublinear algorithms for distributed, MPC, and dynamic streaming models. The approach, including desparsification of spectral sparsifiers and robust deterministic streaming, provides a versatile route to port classical correlation-clustering guarantees to practical sublinear computation settings with provable approximation bounds.

Abstract

Correlation clustering is a widely-used approach for clustering large data sets based only on pairwise similarity information. In recent years, there has been a steady stream of better and better classical algorithms for approximating this problem. Meanwhile, another line of research has focused on porting the classical advances to various sublinear algorithm models, including semi-streaming, Massively Parallel Computation (MPC), and distributed computing. Yet, these latter works typically rely on ad-hoc approaches that do not necessarily keep up with advances in approximation ratios achieved by classical algorithms. Hence, the motivating question for our work is this: can the gains made by classical algorithms for correlation clustering be ported over to sublinear algorithms in a \emph{black-box manner}? We answer this question in the affirmative by introducing the paradigm of graph de-sparsification.

Correlation Clustering and (De)Sparsification: Graph Sketches Can Match Classical Algorithms

TL;DR

This work develops a graph-sketching framework for correlation clustering that achieves near-optimal polynomial-time guarantees in a sublinear setting. The core idea is a de-sparsification paradigm that first recovers a fractional or spectral sparsifier and then rounds it to a simple graph, preserving cut structure sufficiently to retain the CC value within a small factor. By coupling this with convex optimization and effective-resistance-based sampling, the authors obtain a linear sketch of size ~Ŝ(n) that enables -approximate CC on any -vertex graph in polynomial time, and they translate these results into sublinear algorithms for distributed, MPC, and dynamic streaming models. The approach, including desparsification of spectral sparsifiers and robust deterministic streaming, provides a versatile route to port classical correlation-clustering guarantees to practical sublinear computation settings with provable approximation bounds.

Abstract

Correlation clustering is a widely-used approach for clustering large data sets based only on pairwise similarity information. In recent years, there has been a steady stream of better and better classical algorithms for approximating this problem. Meanwhile, another line of research has focused on porting the classical advances to various sublinear algorithm models, including semi-streaming, Massively Parallel Computation (MPC), and distributed computing. Yet, these latter works typically rely on ad-hoc approaches that do not necessarily keep up with advances in approximation ratios achieved by classical algorithms. Hence, the motivating question for our work is this: can the gains made by classical algorithms for correlation clustering be ported over to sublinear algorithms in a \emph{black-box manner}? We answer this question in the affirmative by introducing the paradigm of graph de-sparsification.

Paper Structure

This paper contains 37 sections, 28 theorems, 59 equations.

Key Result

Corollary 1.1

There is a polynomial-time randomized algorithm for correlation clustering in the distributed communication model with $k$ machines that uses $\widetilde{O}(nk)$ communication in total, and with high probability, achieves an $(\alpha_{\textnormal{best}}+o(1))$-approximation.

Theorems & Definitions (64)

  • Corollary 1.1
  • Corollary 1.2
  • Corollary 1.3
  • Corollary 1.4
  • Remark 1.5
  • Definition 2.1
  • Definition 2.2: BenczurK96
  • Definition 2.3
  • Definition 2.4: SpeilmanT08
  • Definition 2.5
  • ...and 54 more