Table of Contents
Fetching ...

Bounded Graph Clustering with Graph Neural Networks

Kibidi Neocosmos, Diego Baptista, Nicole Ludwig

TL;DR

The paper tackles the issue that graph neural networks for community detection often fail to produce a user-specified number of clusters. It introduces a constraint-based approach that bounds the number of output communities by modifying the loss with a row-normalized cluster-assignment constraint and a balance regularizer, enabling ranges or exact counts. Empirical results on synthetic SBMs and real networks show the constraint effectively enforces bounds, improves clustering quality when combined with regularization, and preserves runtime. The work also outlines limitations and avenues for future research, such as searching for optimal numbers of communities and evaluating performance under weaker community structure.

Abstract

In community detection, many methods require the user to specify the number of clusters in advance since an exhaustive search over all possible values is computationally infeasible. While some classical algorithms can infer this number directly from the data, this is typically not the case for graph neural networks (GNNs): even when a desired number of clusters is specified, standard GNN-based methods often fail to return the exact number due to the way they are designed. In this work, we address this limitation by introducing a flexible and principled way to control the number of communities discovered by GNNs. Rather than assuming the true number of clusters is known, we propose a framework that allows the user to specify a plausible range and enforce these bounds during training. However, if the user wants an exact number of clusters, it may also be specified and reliably returned.

Bounded Graph Clustering with Graph Neural Networks

TL;DR

The paper tackles the issue that graph neural networks for community detection often fail to produce a user-specified number of clusters. It introduces a constraint-based approach that bounds the number of output communities by modifying the loss with a row-normalized cluster-assignment constraint and a balance regularizer, enabling ranges or exact counts. Empirical results on synthetic SBMs and real networks show the constraint effectively enforces bounds, improves clustering quality when combined with regularization, and preserves runtime. The work also outlines limitations and avenues for future research, such as searching for optimal numbers of communities and evaluating performance under weaker community structure.

Abstract

In community detection, many methods require the user to specify the number of clusters in advance since an exhaustive search over all possible values is computationally infeasible. While some classical algorithms can infer this number directly from the data, this is typically not the case for graph neural networks (GNNs): even when a desired number of clusters is specified, standard GNN-based methods often fail to return the exact number due to the way they are designed. In this work, we address this limitation by introducing a flexible and principled way to control the number of communities discovered by GNNs. Rather than assuming the true number of clusters is known, we propose a framework that allows the user to specify a plausible range and enforce these bounds during training. However, if the user wants an exact number of clusters, it may also be specified and reliably returned.

Paper Structure

This paper contains 17 sections, 18 equations, 8 figures, 6 tables, 2 algorithms.

Figures (8)

  • Figure 1: (a) The number of communities predicted by model GNN+REG+CONSTRAINT as the lower bound varies on small networks with medium density. There are 10 networks and the model was run 3 times (with different seeds), hence there are a maximum of 30 counts per lower bound. The gray area represents the bounded region given by the lower ($l$) and upper ($c$) bounds. The dashed line represents the ground truth number of clusters. (b) A box-and-whisker plot of the adjusted rand index (ARI) corresponding to (a).
  • Figure 2: The number of communities found by each model when the number of clusters is varied. The experiments were performed on medium networks with medium density. 10 networks were generated for each value on the x-axis and each model was run 3 separate times per network. The gray shaded area represents the bounded region for GNN+REG+CONSTRAINT and GNN+CONSTRAINT and the horizontal dashed lines represent the ground truth number of clusters. The points represent the average output for each model with minimum and maximum error bars.
  • Figure 3: Network visualizations showing community assignments for each method for one run on one medium-sized, medium-density network. Each sub-figure displays the same network with nodes colored according to their assigned communities. Single-node communities are shown as squares, while multi-node communities are shown as circles. There is one square in (b) and two squares in (c).
  • Figure 4: A box-and-whisker plot for the adjusted rand index (ARI) score of each model on 10 medium networks with medium density and 20 ground truth communities. Each model was run three times, resulting in 30 experiments per model.
  • Figure 5: The number of communities predicted for the real datasets (table \ref{['tab:real_stats']}). The vertical gray bars represent the constrained region designated by the lower ($l$) and upper ($c$) bounds. The model was run three times with different seeds, hence there are three points per dataset.
  • ...and 3 more figures