GraphViz2Vec: A Structure-aware Feature Generation Model to Improve Classification in GNNs
Shraban Kumar Chatterjee, Suman Kundu
TL;DR
GraphViz2Vec addresses the deficiency of traditional GNN initial embeddings in capturing local graph structure and the tendency toward over-smoothing by introducing a structure-aware feature-generation pipeline. It combines random-walk induced subgraph projection, energy-based visualization via Kamada-Kawai layouts, and DenseNet-based image modeling to produce node embeddings that preserve neighborhood structure. The approach, decoupled from the GNN, enables state-of-the-art or near state-of-the-art performance across node and link classification tasks with only two GNN layers, reducing complexity and training requirements. Empirical results across diverse datasets and 12 GNN models demonstrate consistent improvements, with notable gains on several benchmarks and the ability to scale through batching and non-end-to-end feature extraction.
Abstract
GNNs are widely used to solve various tasks including node classification and link prediction. Most of the GNN architectures assume the initial embedding to be random or generated from popular distributions. These initial embeddings require multiple layers of transformation to converge into a meaningful latent representation. While number of layers allow accumulation of larger neighbourhood of a node it also introduce the problem of over-smoothing. In addition, GNNs are inept at representing structural information. For example, the output embedding of a node does not capture its triangles participation. In this paper, we presented a novel feature extraction methodology GraphViz2Vec that can capture the structural information of a node's local neighbourhood to create meaningful initial embeddings for a GNN model. These initial embeddings helps existing models achieve state-of-the-art results in various classification tasks. Further, these initial embeddings help the model to produce desired results with only two layers which in turn reduce the problem of over-smoothing. The initial encoding of a node is obtained from an image classification model trained on multiple energy diagrams of its local neighbourhood. These energy diagrams are generated with the induced sub-graph of the nodes traversed by multiple random walks. The generated encodings increase the performance of existing models on classification tasks (with a mean increase of $4.65\%$ and $2.58\%$ for the node and link classification tasks, respectively), with some models achieving state-of-the-art results.
