Table of Contents
Fetching ...

Information-Theoretic Active Correlation Clustering

Linus Aronsson, Morteza Haghir Chehreghani

TL;DR

Correlation clustering can be expensive due to needing many pairwise similarities. The authors propose information-theoretic acquisition functions (Entropy, EIG-O, JEIG) computed via a mean-field approximation to guide costly queries, enabling active CC under budget. They show empirically that the proposed methods outperform baselines across multiple datasets and noise settings, with JEIG providing the best accuracy-efficiency trade-off. This work advances principled query selection in non-parametric clustering and broadens active learning for pairwise relational models.

Abstract

Correlation clustering is a flexible framework for partitioning data based solely on pairwise similarity or dissimilarity information, without requiring the number of clusters as input. However, in many practical scenarios, these pairwise similarities are not available a priori and must be obtained through costly measurements or human feedback. This motivates the use of active learning to query only the most informative pairwise comparisons, enabling effective clustering under budget constraints. In this work, we develop a principled active learning approach for correlation clustering by introducing several information-theoretic acquisition functions that prioritize queries based on entropy and expected information gain. These strategies aim to reduce uncertainty about the clustering structure as efficiently as possible. We evaluate our methods across a range of synthetic and real-world settings and show that they significantly outperform existing baselines in terms of clustering accuracy and query efficiency. Our results highlight the benefits of combining active learning with correlation clustering in settings where similarity information is costly or limited.

Information-Theoretic Active Correlation Clustering

TL;DR

Correlation clustering can be expensive due to needing many pairwise similarities. The authors propose information-theoretic acquisition functions (Entropy, EIG-O, JEIG) computed via a mean-field approximation to guide costly queries, enabling active CC under budget. They show empirically that the proposed methods outperform baselines across multiple datasets and noise settings, with JEIG providing the best accuracy-efficiency trade-off. This work advances principled query selection in non-parametric clustering and broadens active learning for pairwise relational models.

Abstract

Correlation clustering is a flexible framework for partitioning data based solely on pairwise similarity or dissimilarity information, without requiring the number of clusters as input. However, in many practical scenarios, these pairwise similarities are not available a priori and must be obtained through costly measurements or human feedback. This motivates the use of active learning to query only the most informative pairwise comparisons, enabling effective clustering under budget constraints. In this work, we develop a principled active learning approach for correlation clustering by introducing several information-theoretic acquisition functions that prioritize queries based on entropy and expected information gain. These strategies aim to reduce uncertainty about the clustering structure as efficiently as possible. We evaluate our methods across a range of synthetic and real-world settings and show that they significantly outperform existing baselines in terms of clustering accuracy and query efficiency. Our results highlight the benefits of combining active learning with correlation clustering in settings where similarity information is costly or limited.
Paper Structure (21 sections, 4 theorems, 41 equations, 4 figures, 5 algorithms)

This paper contains 21 sections, 4 theorems, 41 equations, 4 figures, 5 algorithms.

Key Result

Proposition 1

Eq. eq:cost can be simplified to $R^{\text{CC}}({\bm{c}} \mid {\bm{S}})= -\sum_{\substack{(u, v) \in \mathcal{E}\\c_u=c_v}} {S}_{uv} + \text{constant}$, where the constant is independent of different clustering solutions ethz-a-010077098Chehreghani22_shift.

Figures (4)

  • Figure 1: Results for Oracle 1 with noise level $\gamma = 0.4$. The evaluation metric is the adjusted rand index (ARI).
  • Figure 2: Results for Oracle 2. The evaluation metric is the adjusted rand index (ARI).
  • Figure 3: (a) Varying the noise level $\gamma$ of various methods on the synthetic dataset using Oracle 1. Some baselines are excluded for clarity due to very poor performance. (b) Varying the number of candidate pairs $|\mathcal{E}^{\text{EIG}}|$ for $a^{\text{EIG-O}}$. (c) Varying the subset size $|\mathcal{D}_i|$ for $a^{\text{JEIG}}$. (d) Varying the concentration parameter $\beta$ for $a^{\text{Entropy}}$.
  • Figure 4: Average runtime of each iteration across four of the datasets (synthetic, forest type mapping, ecoli and user knowledge) in seconds.

Theorems & Definitions (6)

  • Proposition 1
  • Theorem 1
  • Proposition 1
  • proof
  • Theorem 1
  • proof