Resampled Mutual Information for Clustering and Community Detection
Cheaheon Lim
TL;DR
The paper tackles clustering similarity evaluation by enforcing a constant baseline without relying on uncertain chance corrections. It introduces ResMI, an information-theoretic measure based on pairwise sampling, defined as $\text{ResMI}(f,g) = \frac{ I\big( 1_{\{ f(Z_1)=f(Z_2) \}} , 1_{\{ g(Z_1)=g(Z_2) \}} \big) }{ 0.5 \left( H\big(1_{\{ f(Z_1)=f(Z_2) \}} \big) + H\big(1_{\{ g(Z_1)=g(Z_2) \}} \big) \right) }$, and shown to be model-independent, constant-baseline, and zero on trivial labelings. Through synthetic experiments, it demonstrates robustness to cluster-count and symmetry biases, outperforming traditional NMI-based and other adjusted measures. The method is also validated on real contact-tracing networks, where ResMI yields meaningful community recovery and aligns with expert-ground-truth structures. Overall, ResMI provides a principled, interpretable, and scalable alternative for clustering similarity and community detection tasks, with potential integration into algorithms.
Abstract
We introduce resampled mutual information (ResMI), a novel measure of clustering similarity that combines insights from information theoretic and pair counting approaches to clustering and community detection. Similar to chance-corrected measures, ResMI satisfies the constant baseline property, but it has the advantages of not requiring adjustment terms and being fully interpretable in the language of information theory. Experiments on synthetic datasets demonstrate that ResMI is robust to common biases exhibited by existing measures, particularly in settings with high cluster counts and asymmetric cluster distributions. Additionally, we show that ResMI identifies meaningful community structures in two real contact tracing networks.
