Table of Contents
Fetching ...

An FPT Constant-Factor Approximation Algorithm for Correlation Clustering

Jianqi Zhou, Zhongyi Zhang, Jiong Guo

TL;DR

The paper tackles Correlation Clustering on general unweighted graphs with missing edges by parameterizing the instance with $k$, the minimum number of vertices to delete to obtain a complete graph. It introduces Algorithm CC, a first FPT constant-factor approximation running in $2^{O(k^3)}\cdot\text{poly}(n)$ time, achieving a factor of $(\tfrac{18}{\delta^2}+7.3)$ for $\delta=\tfrac{1}{65}$ (about $76057.3$). The approach combines a partitioning strategy of the bad vertices via multiway cuts for all partitions and a $\delta$-good/$\delta$-clean framework to handle bad vertices in a single cluster, leveraging a $(1.994+\varepsilon)$-approximation on complete subgraphs and a series of specialized subroutines. This work advances parameterized approximability for CC on general graphs and points to potential hybrids with LP-based or randomized methods, as well as extensions to other parameterizations. The results demonstrate a concrete path toward practical clustering under incomplete similarity information.

Abstract

The Correlation Clustering problem is one of the most extensively studied clustering formulations due to its wide applications in machine learning, data mining, computational biology and other areas. We consider the Correlation Clustering problem on general graphs, where given an undirected graph (maybe not complete) with each edge being labeled with $\langle + \rangle$ or $\langle - \rangle$, the goal is to partition the vertices into clusters to minimize the number of the disagreements with the edge labeling: the number of $\langle - \rangle$ edges within clusters plus the number of $\langle + \rangle$ edges between clusters. Hereby, a $\langle + \rangle$ (or $\langle - \rangle$) edge means that its end-vertices are similar (or dissimilar) and should belong to the same cluster (or different clusters), and ``missing'' edges are used to denote that we do not know if those end-vertices are similar or dissimilar. Correlation Clustering is NP-hard, even if the input graph is complete, and Unique-Games hard to obtain polynomial-time constant approximation on general graphs. With a complete graph as input, Correlation Clustering admits a $(1.994+\varepsilon )$-approximation. We investigate Correlation Clustering on general graphs from the perspective of parameterized approximability. We set the parameter $k$ as the minimum number of vertices whose removal results in a complete graph, and obtain the first FPT constant-factor approximation for Correlation Clustering on general graphs which runs in $2^{O(k^3)} \cdot \textrm{poly}(n)$ time.

An FPT Constant-Factor Approximation Algorithm for Correlation Clustering

TL;DR

The paper tackles Correlation Clustering on general unweighted graphs with missing edges by parameterizing the instance with , the minimum number of vertices to delete to obtain a complete graph. It introduces Algorithm CC, a first FPT constant-factor approximation running in time, achieving a factor of for (about ). The approach combines a partitioning strategy of the bad vertices via multiway cuts for all partitions and a -good/-clean framework to handle bad vertices in a single cluster, leveraging a -approximation on complete subgraphs and a series of specialized subroutines. This work advances parameterized approximability for CC on general graphs and points to potential hybrids with LP-based or randomized methods, as well as extensions to other parameterizations. The results demonstrate a concrete path toward practical clustering under incomplete similarity information.

Abstract

The Correlation Clustering problem is one of the most extensively studied clustering formulations due to its wide applications in machine learning, data mining, computational biology and other areas. We consider the Correlation Clustering problem on general graphs, where given an undirected graph (maybe not complete) with each edge being labeled with or , the goal is to partition the vertices into clusters to minimize the number of the disagreements with the edge labeling: the number of edges within clusters plus the number of edges between clusters. Hereby, a (or ) edge means that its end-vertices are similar (or dissimilar) and should belong to the same cluster (or different clusters), and ``missing'' edges are used to denote that we do not know if those end-vertices are similar or dissimilar. Correlation Clustering is NP-hard, even if the input graph is complete, and Unique-Games hard to obtain polynomial-time constant approximation on general graphs. With a complete graph as input, Correlation Clustering admits a -approximation. We investigate Correlation Clustering on general graphs from the perspective of parameterized approximability. We set the parameter as the minimum number of vertices whose removal results in a complete graph, and obtain the first FPT constant-factor approximation for Correlation Clustering on general graphs which runs in time.

Paper Structure

This paper contains 7 sections, 6 theorems, 1 equation.

Key Result

lemma thmcounterlemma

Let $C$ be a $\delta$-clean cluster with $C\supseteq B$ and $\delta \leq 1/5$. If $\vert C \vert > \frac{1}{\delta}\vert B \vert$, then the number of $\langle - \rangle$ edges whose end-vertices are two good vertices in $C\setminus B$ can be bounded by $2\cdot m_{\mathbb{H}}(\rm{OPT}(\mathbb{H}))$.

Theorems & Definitions (6)

  • lemma thmcounterlemma
  • lemma thmcounterlemma
  • lemma thmcounterlemma
  • lemma thmcounterlemma
  • lemma thmcounterlemma
  • theorem thmcountertheorem