Table of Contents
Fetching ...

Resampled Mutual Information for Clustering and Community Detection

Cheaheon Lim

TL;DR

The paper tackles clustering similarity evaluation by enforcing a constant baseline without relying on uncertain chance corrections. It introduces ResMI, an information-theoretic measure based on pairwise sampling, defined as $\text{ResMI}(f,g) = \frac{ I\big( 1_{\{ f(Z_1)=f(Z_2) \}} , 1_{\{ g(Z_1)=g(Z_2) \}} \big) }{ 0.5 \left( H\big(1_{\{ f(Z_1)=f(Z_2) \}} \big) + H\big(1_{\{ g(Z_1)=g(Z_2) \}} \big) \right) }$, and shown to be model-independent, constant-baseline, and zero on trivial labelings. Through synthetic experiments, it demonstrates robustness to cluster-count and symmetry biases, outperforming traditional NMI-based and other adjusted measures. The method is also validated on real contact-tracing networks, where ResMI yields meaningful community recovery and aligns with expert-ground-truth structures. Overall, ResMI provides a principled, interpretable, and scalable alternative for clustering similarity and community detection tasks, with potential integration into algorithms.

Abstract

We introduce resampled mutual information (ResMI), a novel measure of clustering similarity that combines insights from information theoretic and pair counting approaches to clustering and community detection. Similar to chance-corrected measures, ResMI satisfies the constant baseline property, but it has the advantages of not requiring adjustment terms and being fully interpretable in the language of information theory. Experiments on synthetic datasets demonstrate that ResMI is robust to common biases exhibited by existing measures, particularly in settings with high cluster counts and asymmetric cluster distributions. Additionally, we show that ResMI identifies meaningful community structures in two real contact tracing networks.

Resampled Mutual Information for Clustering and Community Detection

TL;DR

The paper tackles clustering similarity evaluation by enforcing a constant baseline without relying on uncertain chance corrections. It introduces ResMI, an information-theoretic measure based on pairwise sampling, defined as , and shown to be model-independent, constant-baseline, and zero on trivial labelings. Through synthetic experiments, it demonstrates robustness to cluster-count and symmetry biases, outperforming traditional NMI-based and other adjusted measures. The method is also validated on real contact-tracing networks, where ResMI yields meaningful community recovery and aligns with expert-ground-truth structures. Overall, ResMI provides a principled, interpretable, and scalable alternative for clustering similarity and community detection tasks, with potential integration into algorithms.

Abstract

We introduce resampled mutual information (ResMI), a novel measure of clustering similarity that combines insights from information theoretic and pair counting approaches to clustering and community detection. Similar to chance-corrected measures, ResMI satisfies the constant baseline property, but it has the advantages of not requiring adjustment terms and being fully interpretable in the language of information theory. Experiments on synthetic datasets demonstrate that ResMI is robust to common biases exhibited by existing measures, particularly in settings with high cluster counts and asymmetric cluster distributions. Additionally, we show that ResMI identifies meaningful community structures in two real contact tracing networks.

Paper Structure

This paper contains 8 sections, 11 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Results from synthetic data experiments, averaged over 100 runs with error bars denoting one standard deviation. In (a)-(c), the ground truth consists of 1,024 objects in 32 equally sized clusters. Similarity with the ground truth is plotted for clusterings generated by (a) random assignment to $c$ clusters, (b) random merging/splitting of ground truth clusters to reach $c$ clusters, and (c) random shuffling of $p$ proportion of labels. In (d), the ground truth has 512 objects in one cluster, 256 in another, and the remainder evenly distributed across 5 others, with new clusterings generated by randomly reassigning labels to $p$ proportion of objects outside the first cluster.
  • Figure 2: Similarity between ground truth and estimated community labels by SCORE+ for varying values of $c$, averaged over 100 runs with error bars denoting one standard deviation. Dotted gray lines mark the ground truth number of communities. (a) Implementation on the contact tracing network of contact_data_1. (b) Implementation on the contact tracing network of contact_data_2.
  • Figure 3: (a) Ground truth and (b) estimated community labels (with $c=4$) for the contact tracing network of contact_data_1. There are 92 nodes and 755 edges, and the network layout was determined by multidimensional scaling. The distribution of nodes across ground truth communities is $\{ 34 , 26, 15 , 13, 4\}$ and the distribution of nodes across estimated communities is $\{ 39 ,25, 15, 13\}$.
  • Figure 4: (a) Ground truth community labels, and estimated community labels with (b) $c=4$, (c) $c=10$, and (d) $c=17$ for the contact tracing network of contact_data_2. There are 217 nodes and 4,274 edges, and the network layout was determined by multidimensional scaling. The distribution of nodes across ground truth communities is $\{ 57 , 32 , 31 , 23 , 18 , 14 , 13 , 9 , 7 , 7 , 4 , 2 \}$. The distribution of nodes across the estimated communities are (b) $\{69, 58, 49 ,41\}$, (c) $\{ 37 ,37,29, 26, 24, 22 , 21, 15 , 5, 1 \}$, and (d) $\{ 33, 25 ,23, 22 ,17 ,17 ,16, 13 ,10 ,10, 10, 8 , 6 , 4 , 1, 1 , 1 \}$.