Information-Theoretic Active Correlation Clustering
Linus Aronsson, Morteza Haghir Chehreghani
TL;DR
Correlation clustering can be expensive due to needing many pairwise similarities. The authors propose information-theoretic acquisition functions (Entropy, EIG-O, JEIG) computed via a mean-field approximation to guide costly queries, enabling active CC under budget. They show empirically that the proposed methods outperform baselines across multiple datasets and noise settings, with JEIG providing the best accuracy-efficiency trade-off. This work advances principled query selection in non-parametric clustering and broadens active learning for pairwise relational models.
Abstract
Correlation clustering is a flexible framework for partitioning data based solely on pairwise similarity or dissimilarity information, without requiring the number of clusters as input. However, in many practical scenarios, these pairwise similarities are not available a priori and must be obtained through costly measurements or human feedback. This motivates the use of active learning to query only the most informative pairwise comparisons, enabling effective clustering under budget constraints. In this work, we develop a principled active learning approach for correlation clustering by introducing several information-theoretic acquisition functions that prioritize queries based on entropy and expected information gain. These strategies aim to reduce uncertainty about the clustering structure as efficiently as possible. We evaluate our methods across a range of synthetic and real-world settings and show that they significantly outperform existing baselines in terms of clustering accuracy and query efficiency. Our results highlight the benefits of combining active learning with correlation clustering in settings where similarity information is costly or limited.
