Table of Contents
Fetching ...

Comparison of modularity-based approaches for nodes clustering in hypergraphs

Veronica Poda, Catherine Matias

TL;DR

A comparative analysis of the performance of modularity-based methods for clustering nodes in binary hypergraphs using different quality measures, including true clustering recovery, running time, (local) maximization of the objective, and the number of clusters detected.

Abstract

Statistical analysis and node clustering in hypergraphs constitute an emerging topic suffering from a lack of standardization. In contrast to the case of graphs, the concept of nodes' community in hypergraphs is not unique and encompasses various distinct situations. In this work, we conducted a comparative analysis of the performance of modularity-based methods for clustering nodes in binary hypergraphs. To address this, we begin by presenting, within a unified framework, the various hypergraph modularity criteria proposed in the literature, emphasizing their differences and respective focuses. Subsequently, we provide an overview of the state-of-the-art codes available to maximize hypergraph modularities for detecting node communities in binary hypergraphs. Through exploration of various simulation settings with controlled ground truth clustering, we offer a comparison of these methods using different quality measures, including true clustering recovery, running time, (local) maximization of the objective, and the number of clusters detected. Our contribution marks the first attempt to clarify the advantages and drawbacks of these newly available methods. This effort lays the foundation for a better understanding of the primary objectives of modularity-based node clustering methods for binary hypergraphs.

Comparison of modularity-based approaches for nodes clustering in hypergraphs

TL;DR

A comparative analysis of the performance of modularity-based methods for clustering nodes in binary hypergraphs using different quality measures, including true clustering recovery, running time, (local) maximization of the objective, and the number of clusters detected.

Abstract

Statistical analysis and node clustering in hypergraphs constitute an emerging topic suffering from a lack of standardization. In contrast to the case of graphs, the concept of nodes' community in hypergraphs is not unique and encompasses various distinct situations. In this work, we conducted a comparative analysis of the performance of modularity-based methods for clustering nodes in binary hypergraphs. To address this, we begin by presenting, within a unified framework, the various hypergraph modularity criteria proposed in the literature, emphasizing their differences and respective focuses. Subsequently, we provide an overview of the state-of-the-art codes available to maximize hypergraph modularities for detecting node communities in binary hypergraphs. Through exploration of various simulation settings with controlled ground truth clustering, we offer a comparison of these methods using different quality measures, including true clustering recovery, running time, (local) maximization of the objective, and the number of clusters detected. Our contribution marks the first attempt to clarify the advantages and drawbacks of these newly available methods. This effort lays the foundation for a better understanding of the primary objectives of modularity-based node clustering methods for binary hypergraphs.
Paper Structure (16 sections, 39 equations, 10 figures, 3 tables)

This paper contains 16 sections, 39 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: On the left, a modular graph with two clusters is depicted, represented as circle-blue and triangle-green nodes, respectively. In each cluster, the number of within-cluster interactions is much larger than the between-clusters ones. On the right, a hypergraph is shown using the same set of nodes, where each clique from the previous graph is replaced by a hyperedge. In this hypergraph, the number of within-cluster interactions in each of the two clusters is the same as the number of between-clusters interactions. Is this hypergraph modular? Should we consider weighting hyperedges with respect to their sizes to analyze how modular the hypergraph is?
  • Figure 2: Datasets HSBM, scenarios A1 to A5. Comparison by increasing the number of nodes from $n\in \{50,100,150,200,500\}$: Adjusted Rand Index (top left), time in seconds (top right), relative error on modularity (bottom left) and estimated number of clusters (bottom right, true value is $3$). The IRMM and (consequently) the LSR methods both gave an error on one dataset in scenario A5. Outlier points have been removed: from the relative error plot (bottom left), 1 value below -500 concerning the IRMM method in scenario A1. Moreover, one dataset from scenario A5 gave an error with the IRMM and (consequently) the LSR methods; corresponding results were removed from the plots.
  • Figure 3: Datasets DCHSBM, scenarios A1 to A6. Comparison by increasing the number of nodes from $n\in \{50,100,150,200,500,1000\}$: Adjusted Rand Index (top left), time in seconds (top right), relative error on modularity (bottom left) and estimated number of clusters (bottom right, true value is $3$). From the time plot (top right), values for the LSR method in scenario A6 range between 15,796 and 22,350 seconds and are not shown. Outlier points have been removed from the relative error plot (bottom left): 1 value above 300 concerning the IRMM method in scenario A4.
  • Figure 4: Datasets h-ABCD, scenarios A1 to A6. Comparison by increasing the number of nodes from $n\in \{50,100,150,200,500,1000\}$: Adjusted Rand Index (top left), time in seconds (top right), relative error on modularity (bottom left) and estimated number of clusters (bottom right, true value is $3$). Outlier points have been removed: from the relative error plot (bottom left), 3 values at 25, -50 and -55 concerning the IRMM method with in scenarios A6, A2 and A3 respectively.
  • Figure 5: Datasets DCHSBM, scenarios B1 to B6. Comparison by increasing the number of nodes from $n\in \{50,100,150,200,500,1000\}$: Adjusted Rand Index (top left), time in seconds (top right), relative error on modularity (bottom left) and estimated number of clusters (bottom right, true value is $3$). From the time plot (top right), values for the LSR method in scenario B6 range between 16,961 and 17,895 seconds are not shown.
  • ...and 5 more figures