Table of Contents
Fetching ...

Pivot based correlation clustering in the presence of good clusters

David Rasmussen Lolck, Mikkel Thorup, Shuyi Yan

Abstract

The classic pivot based clustering algorithm of Ailon, Charikar and Chawla [JACM'08] is factor 3, but all concrete examples showing that it is no better than 3 are based on some very good clusters, e.g., a complete graph minus a matching. By removing all good clusters before we make each pivot step, we show that this improves the approximation ratio to $2.9991$. To aid in this, we also show how our proposed algorithm performs on synthetic datasets, where the algorithm performs remarkably well, and shows improvements over both the algorithm for locating good clusters and the classic pivot algorithm.

Pivot based correlation clustering in the presence of good clusters

Abstract

The classic pivot based clustering algorithm of Ailon, Charikar and Chawla [JACM'08] is factor 3, but all concrete examples showing that it is no better than 3 are based on some very good clusters, e.g., a complete graph minus a matching. By removing all good clusters before we make each pivot step, we show that this improves the approximation ratio to . To aid in this, we also show how our proposed algorithm performs on synthetic datasets, where the algorithm performs remarkably well, and shows improvements over both the algorithm for locating good clusters and the classic pivot algorithm.
Paper Structure (22 sections, 29 theorems, 89 equations, 2 figures, 4 algorithms)

This paper contains 22 sections, 29 theorems, 89 equations, 2 figures, 4 algorithms.

Key Result

Theorem 1

alg:atom-pivot is a $2.9991$ approximation in time $O(m\log n)$.

Figures (2)

  • Figure 1: Decrease of the cost of the optimal solution $\mathrm{opt}$, an amortisation $g$ of this cost taking changing the optimal clustering into account, and the cost of the pivot algorithm $\mathrm{cost}$ all up to symmetries in the labelling of triangles.
  • Figure 2: Performance of the algorithms as a function of the noise parameter $\varepsilon$. We generate graphs with $n=10^3$ vertices and a planted partition into $k=10$ clusters. Edges are added between vertices in the same cluster and then independently flipped with probability $\varepsilon$. The y-axis (cost) shows the total number of disagreements in the resulting clustering. For visualization, the plotted curves are smoothed by averaging in log-space over a sliding window of $11$ points: $l_i = \exp\!\left(\frac{1}{11}\sum_{r=i-5}^{i+5} \ln c_r\right)$, where $c_r$ is the observed cost at experiment $r$. Each point corresponds to one of $200$ different values of $\varepsilon$.

Theorems & Definitions (54)

  • Theorem 1
  • Theorem 1
  • Theorem 1
  • Theorem 1
  • proof : Proof of \ref{['thm:alg-approx-ratio']}
  • Definition 2
  • Definition 3
  • Lemma 4: CMSY15
  • Lemma 5: ACN08
  • Lemma 6
  • ...and 44 more