Table of Contents
Fetching ...

Learning-Augmented Streaming Algorithms for Correlation Clustering

Yinhao Dong, Shan Jiang, Shi Li, Pan Peng

TL;DR

This work develops the first learning-augmented streaming algorithms for Correlation Clustering on both complete and general graphs, leveraging a predictor for pairwise distances to improve space-accuracy tradeoffs. For complete graphs, a single-pass dynamic-streaming method achieves a $(\min\{2.06\beta,3\}+\varepsilon)$-approximation with $\tilde{O}(\varepsilon^{-2}n)$ space, beating the previous $(3+\varepsilon)$-approximation in dynamic streams when predictions are good. For general graphs, the authors obtain an $O(\beta\log|E^-|)$-approximation with $\tilde{O}(\varepsilon^{-2}n)$ space by combining spectral sparsification, a predictor-informed ball-growing procedure, and a conditional post-processing path. Theoretical guarantees are complemented by extensive experiments on synthetic and real-world data, showing substantial improvements over non-learning baselines and robustness to poor predictions. The work advances practical, scalable clustering in streaming settings by integrating learning-powered predictions with classic graph-structural algorithms.

Abstract

We study streaming algorithms for Correlation Clustering. Given a graph as an arbitrary-order stream of edges, with each edge labeled as positive or negative, the goal is to partition the vertices into disjoint clusters, such that the number of disagreements is minimized. In this paper, we give the first learning-augmented streaming algorithms for the problem on both complete and general graphs, improving the best-known space-approximation tradeoffs. Based on the works of Cambus et al. (SODA'24) and Ahn et al. (ICML'15), our algorithms use the predictions of pairwise distances between vertices provided by a predictor. For complete graphs, our algorithm achieves a better-than-$3$ approximation under good prediction quality, while using $\tilde{O}(n)$ total space. For general graphs, our algorithm achieves an $O(\log |E^-|)$ approximation under good prediction quality using $\tilde{O}(n)$ total space, improving the best-known non-learning algorithm in terms of space efficiency. Experimental results on synthetic and real-world datasets demonstrate the superiority of our proposed algorithms over their non-learning counterparts.

Learning-Augmented Streaming Algorithms for Correlation Clustering

TL;DR

This work develops the first learning-augmented streaming algorithms for Correlation Clustering on both complete and general graphs, leveraging a predictor for pairwise distances to improve space-accuracy tradeoffs. For complete graphs, a single-pass dynamic-streaming method achieves a -approximation with space, beating the previous -approximation in dynamic streams when predictions are good. For general graphs, the authors obtain an -approximation with space by combining spectral sparsification, a predictor-informed ball-growing procedure, and a conditional post-processing path. Theoretical guarantees are complemented by extensive experiments on synthetic and real-world data, showing substantial improvements over non-learning baselines and robustness to poor predictions. The work advances practical, scalable clustering in streaming settings by integrating learning-powered predictions with classic graph-structural algorithms.

Abstract

We study streaming algorithms for Correlation Clustering. Given a graph as an arbitrary-order stream of edges, with each edge labeled as positive or negative, the goal is to partition the vertices into disjoint clusters, such that the number of disagreements is minimized. In this paper, we give the first learning-augmented streaming algorithms for the problem on both complete and general graphs, improving the best-known space-approximation tradeoffs. Based on the works of Cambus et al. (SODA'24) and Ahn et al. (ICML'15), our algorithms use the predictions of pairwise distances between vertices provided by a predictor. For complete graphs, our algorithm achieves a better-than- approximation under good prediction quality, while using total space. For general graphs, our algorithm achieves an approximation under good prediction quality using total space, improving the best-known non-learning algorithm in terms of space efficiency. Experimental results on synthetic and real-world datasets demonstrate the superiority of our proposed algorithms over their non-learning counterparts.

Paper Structure

This paper contains 46 sections, 33 theorems, 14 equations, 3 figures, 6 tables, 12 algorithms.

Key Result

Theorem 1.3

Let $\varepsilon\in (0,1/4)$ and $\beta \geq 1$. Given oracle access to a $\beta$-level predictor, there exists a single-pass streaming algorithm that, with high probability, achieves an expected $(\min\{2.06\beta, 3\}+\varepsilon)$-approximation for Correlation Clustering on complete graphs in dyna

Figures (3)

  • Figure 1: Performance of \ref{['alg:dynamic-stream']} on synthetic datasets. We examine the effects of prediction quality $\beta$, SBM parameter $p$, and graph size $n$. We set $n=100$ in (\ref{['fig:synthetic-p-0.9']})--(\ref{['fig:synthetic-p-0.7']}) and $p=0.95$ in (\ref{['fig:synthetic-vary-n-dynamic']}).
  • Figure 2: Performance of \ref{['alg:dynamic-stream']} on real-world datasets. (\ref{['fig:fb0-vary-beta']})--(\ref{['fig:fb3980-vary-beta']}) show the effect of $\beta$ on three Facebook subgraphs. (\ref{['fig:email-vary-d']}) shows the effect of the dimension $d$ of spectral embeddings on EmailCore. Note that a larger $d$ indicates higher prediction quality (i.e., a smaller $\beta$).
  • Figure 3: Performance of \ref{['alg:insertion-only']} on real-world datasets. (\ref{['fig:insertion-fb0-vary-beta']})--(\ref{['fig:insertion-fb414-vary-beta']}) show the effect of prediction quality $\beta$ on two Facebook subgraphs, where we use noisy predictors. (\ref{['fig:insertion-email-vary-d']})--(\ref{['fig:insertion-lastfm-vary-d']}) examine the effect of the dimension $d$ of spectral embeddings on EmailCore and LastFM, where we use spectral embedding as the predictor. We set $k=25$ for (\ref{['fig:insertion-fb0-vary-beta']}), $k=15$ for (\ref{['fig:insertion-fb414-vary-beta']}), $k=10$ for (\ref{['fig:insertion-email-vary-d']}), and $k=50$ for (\ref{['fig:insertion-lastfm-vary-d']}).

Theorems & Definitions (52)

  • Example 1.1: Multiple graphs on the same vertex set
  • Example 1.2: Temporal graphs
  • Theorem 1.3
  • Theorem 1.4
  • Definition 3.1: $\beta$-level predictor
  • Lemma 4.1
  • Lemma 4.2: Lemma 8 in CKLPU24
  • Lemma 4.3
  • Lemma 4.4: CKLPU24
  • Lemma 4.5
  • ...and 42 more