Refined Graph Encoder Embedding via Self-Training and Latent Community Recovery

Cencheng Shen; Jonathan Larson; Ha Trinh; Carey E. Priebe

Refined Graph Encoder Embedding via Self-Training and Latent Community Recovery

Cencheng Shen, Jonathan Larson, Ha Trinh, Carey E. Priebe

TL;DR

This work tackles improving vertex embeddings beyond standard spectral or GEE methods by refining the graph encoder embedding (GEE) through a linear transformation and iterative latent community recovery. The proposed Refined Graph Encoder Embedding (R-GEE) uses linear discriminant analysis to produce a self-trained embedding and then iteratively uncovers latent communities, outputting concatenated refined embeddings and updated labels. Theoretical results show the GEE embedding is asymptotically normal under the stochastic block model and that the LDA transformation estimates the conditional distribution P(Y|X), guiding when refinement should be applied, with latent refinements enhancing margin separation when beneficial. Empirically, simulations and real graph experiments demonstrate improved vertex classification and meaningful latent recovery while retaining linear-time scalability relative to graph size, offering a theoretically grounded alternative to more opaque deep learning approaches.

Abstract

This paper introduces a refined graph encoder embedding method, enhancing the original graph encoder embedding through linear transformation, self-training, and hidden community recovery within observed communities. We provide the theoretical rationale for the refinement procedure, demonstrating how and why our proposed method can effectively identify useful hidden communities under stochastic block models. Furthermore, we show how the refinement method leads to improved vertex embedding and better decision boundaries for subsequent vertex classification. The efficacy of our approach is validated through numerical experiments, which exhibit clear advantages in identifying meaningful latent communities and improved vertex classification across a collection of simulated and real-world graph data.

Refined Graph Encoder Embedding via Self-Training and Latent Community Recovery

TL;DR

Abstract

Paper Structure (21 sections, 4 theorems, 24 equations, 5 figures, 2 tables, 2 algorithms)

This paper contains 21 sections, 4 theorems, 24 equations, 5 figures, 2 tables, 2 algorithms.

Introduction
Review
Graph Adjacency and Stochastic Block Models
Spectral Embedding and Encoder Embedding
Refined Graph Encoder Embedding
Linear Transformation for Self-Training
Refined GEE via Self-Training and Latent Community Recovery
Running Time Analysis
Theoretical Rationale
Simulations
Model Parameters
Latent Community Recovery
Vertex Classification Evaluation
Real Data Evaluation
Vertex Classification
...and 6 more sections

Key Result

Theorem 1

The graph encoder embedding is asymptotically normally distributed under SBM. Specifically, as $n$ increases, for a given $i$th vertex of class $y$, it holds that The expectation and covariance are: $\mu_{y}=\mathbf{B}(y,:)$ and $\Sigma_{y}(k,k)=\mathbf{B}(y,k)(1-\mathbf{B}(y,k))$. Assuming $\Sigma_{y}$ is the same across all $y \in [1,K]$, the transformation in Equation eq1 satisfies

Figures (5)

Figure 1: This figure shows the running time comparison between GEE, Refined GEE, and SVD. The X-axis represents the approximate number of edges, and the Y-axis represents the running time on a log-10 scale.
Figure 2: This figure visualizes the graph using latent labels (left panel), observed labels (center panel), and GEE-refined labels after one refinement iteration (right panel). In the left panel, the colors dark green, light green, red, and orange represent the four ground-truth latent communities. In the center panel, the dark green and light green vertices from the left panel are combined into a single observed community, colored dark green, while the red and orange vertices are similarly merged into a single observed community, colored red. In the right panel, the R-GEE algorithm refines the observed labels from the center panel and partially recovers the ground-truth latent communities, with the refined communities again represented by light green and orange.
Figure 3: The first row of the figure reports the 10-fold cross-validation error and standard deviation for the three simulated graphs, using 30 replicates. The bottom row of the figure reports the precision and recall for refined GEE in recovering the latent communities.
Figure 4: This figure visualizes two real graphs, the karate club and political blogs, using observed labels (left panel) and GEE-refined labels after one refinement iteration (right panel). In the left panel, the graphs are drawn with vertices colored by observed labels: dark green and red. In the right panel, the same graphs are shown with vertex colors representing refined classes. For the karate club graph, two vertices from the dark green community are refined into a new group, colored light green. For the political blogs graph, some vertices in the dark green group are refined into a new group, also colored light green, while some vertices in the red group are refined into another new group, colored orange.
Figure 5: The left panel displays the precision of the refined GEE in recovering latent communities under four different parameter settings. The right panel shows the vertex classification error.

Theorems & Definitions (6)

Theorem 1
Theorem 2
Theorem 2
proof
Theorem 2
proof

Refined Graph Encoder Embedding via Self-Training and Latent Community Recovery

TL;DR

Abstract

Refined Graph Encoder Embedding via Self-Training and Latent Community Recovery

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (6)