Table of Contents
Fetching ...

VEC-SBM: Optimal Community Detection with Vectorial Edges Covariates

Guillaume Braun, Masashi Sugiyama

TL;DR

This paper introduces the Vectorial Edges Covariates SBM (VEC-SBM), extending the classic SBM to incorporate edge-level vector covariates and quantify their multiplicative effect on the information available for community detection. It proposes an efficient iterative MAP-based algorithm, IR-VEC (and a simplified variant sIR-VEC), with theoretical guarantees showing optimal convergence rates under suitable conditions and a minimax lower bound confirming optimality. The analysis reveals how edge covariates amplify the SNR by a factor dependent on graph sparsity and covariate separation, enabling accurate recovery even when graph structure alone is weak. Numerically, IR-VEC and sIR-VEC outperform baselines across synthetic, semi-synthetic, and real data, demonstrating robustness to non-isotropic covariances and varying numbers of communities, with practical gains in clustering accuracy. The work provides a principled framework for leveraging edge information in network clustering and outlines avenues for future extensions to high-dimensional covariates and model selection.

Abstract

Social networks are often associated with rich side information, such as texts and images. While numerous methods have been developed to identify communities from pairwise interactions, they usually ignore such side information. In this work, we study an extension of the Stochastic Block Model (SBM), a widely used statistical framework for community detection, that integrates vectorial edges covariates: the Vectorial Edges Covariates Stochastic Block Model (VEC-SBM). We propose a novel algorithm based on iterative refinement techniques and show that it optimally recovers the latent communities under the VEC-SBM. Furthermore, we rigorously assess the added value of leveraging edge's side information in the community detection process. We complement our theoretical results with numerical experiments on synthetic and semi-synthetic data.

VEC-SBM: Optimal Community Detection with Vectorial Edges Covariates

TL;DR

This paper introduces the Vectorial Edges Covariates SBM (VEC-SBM), extending the classic SBM to incorporate edge-level vector covariates and quantify their multiplicative effect on the information available for community detection. It proposes an efficient iterative MAP-based algorithm, IR-VEC (and a simplified variant sIR-VEC), with theoretical guarantees showing optimal convergence rates under suitable conditions and a minimax lower bound confirming optimality. The analysis reveals how edge covariates amplify the SNR by a factor dependent on graph sparsity and covariate separation, enabling accurate recovery even when graph structure alone is weak. Numerically, IR-VEC and sIR-VEC outperform baselines across synthetic, semi-synthetic, and real data, demonstrating robustness to non-isotropic covariances and varying numbers of communities, with practical gains in clustering accuracy. The work provides a principled framework for leveraging edge information in network clustering and outlines avenues for future extensions to high-dimensional covariates and model selection.

Abstract

Social networks are often associated with rich side information, such as texts and images. While numerous methods have been developed to identify communities from pairwise interactions, they usually ignore such side information. In this work, we study an extension of the Stochastic Block Model (SBM), a widely used statistical framework for community detection, that integrates vectorial edges covariates: the Vectorial Edges Covariates Stochastic Block Model (VEC-SBM). We propose a novel algorithm based on iterative refinement techniques and show that it optimally recovers the latent communities under the VEC-SBM. Furthermore, we rigorously assess the added value of leveraging edge's side information in the community detection process. We complement our theoretical results with numerical experiments on synthetic and semi-synthetic data.
Paper Structure (48 sections, 8 theorems, 91 equations, 3 figures, 1 algorithm)

This paper contains 48 sections, 8 theorems, 91 equations, 3 figures, 1 algorithm.

Key Result

Theorem 1

Assume that $\Delta_{min}^2 \asymp \log n$. Under assumptions ass:balanced_part, ass:iso,ass:sym and ass:lim_inf, if $z^{(0)}$ is such that for a constant $\epsilon$ small enough, then with probability at least $1-n^{-\Omega(1)}$ we have for all $t\gtrsim \log n$ where $c>0$ is the constant appearing in Lemma lem:conc_oracle.

Figures (3)

  • Figure 1: Average performance over $20$ runs under Scenario 1.
  • Figure 2: Average performance over $20$ runs under Scenario 2.
  • Figure 3: Average performance over $20$ runs with varying $K$ (Scenario 3).

Theorems & Definitions (19)

  • Theorem 1
  • Remark 1
  • Remark 2
  • proof : Sketch of the proof of Theorem \ref{['thm:main']}
  • Theorem 2
  • Remark 3
  • proof : Sketch of the proof
  • Lemma 1
  • proof
  • Corollary 1
  • ...and 9 more