Training A Foundation Model to Represent Graphs as Vectors

Qi Feng; Jicong Fan

Training A Foundation Model to Represent Graphs as Vectors

Qi Feng, Jicong Fan

TL;DR

This work addresses cross-domain graph-level representation by training a Graph Foundation Model (GraphVec-FM) that maps any graph to a fixed-dimensional vector while preserving topology and semantics. It introduces a multi-graph feature alignment strategy to derive consistent node embeddings across domains, a density-maximization mean alignment with convergence guarantees, and a multi-layer reference-distribution readout to retain node-embedding information in the graph representation. The model is trained with supervised (and optionally unsupervised) contrastive objectives, and a theoretical generalization bound is provided to support cross-domain applicability. Empirically, GraphVec-FM achieves strong performance on few-shot graph classification and graph clustering and demonstrates scalability to large datasets via Nyström approximations and batching. The work highlights a principled, domain-agnostic graph representation framework, with potential extension to cross-modal alignment in future work.

Abstract

This paper aims to train a graph foundation model that is able to represent any graph as a vector preserving structural and semantic information useful for downstream graph-level tasks such as graph classification and graph clustering. To learn the features of graphs from diverse domains while maintaining strong generalization ability to new domains, we propose a multi-graph-based feature alignment method, which constructs weighted graphs using the attributes of all nodes in each dataset and then generates consistent node embeddings. To enhance the consistency of the features from different datasets, we propose a density maximization mean alignment algorithm with guaranteed convergence. The original graphs and generated node embeddings are fed into a graph neural network to achieve discriminative graph representations in contrastive learning. More importantly, to enhance the information preservation from node-level representations to the graph-level representation, we construct a multi-layer reference distribution module without using any pooling operation. We also provide a theoretical generalization bound to support the effectiveness of the proposed model. The experimental results of few-shot graph classification and graph clustering show that our model outperforms strong baselines.

Training A Foundation Model to Represent Graphs as Vectors

TL;DR

Abstract

Paper Structure (46 sections, 11 theorems, 62 equations, 7 figures, 17 tables, 2 algorithms)

This paper contains 46 sections, 11 theorems, 62 equations, 7 figures, 17 tables, 2 algorithms.

Introduction
Related Work
Language Model-Free GFMs
Problem Formulation
Methodology
Multi-Graph based Feature Alignment
Density Maximization based Mean Alignment
GIN and Graph Transformer based Model
Reference Distribution based Global Graph Representation
Model Pre-training
Model Testing
Generalization Error Bound
Experiments
Few-Shot Graph Classification
Datasets and Baselines
...and 31 more sections

Key Result

Theorem 4.1

Let $\left\{\mathcal{L}\left(\{\mathbf{R}_j^{(t)}\}_{j=1}^M\right)\right\}_t$ and $\left\{\{\mathbf{R}_j^{(t)}\}_{j=1}^M\right\}_t$ be the sequences given by Algorithm alg_DMMA. Then for any $\eta>0$, it holds that: (a) $\left\{\mathcal{L}\left(\{\mathbf{R}_j^{(t)}\}_{j=1}^M\right)\right\}_t$ is non

Figures (7)

Figure 1: Flow-chart of the proposed method GraphVec-FM. $\mathcal{G}_1,\ldots,\mathcal{G}_M$ are $M$ datasets from different domains. The model represents each graph $G_{i}^{(j)}$ as a single vector $\mathbf{g}_i^{(j)}$ that can be used in graph-level downstream tasks.
Figure 2: The change of classification accuracy in ENZYMES when the number of datasets used in pre-training increases from $1$ to $4$.
Figure 3: Classification accuracy trends of our method GraphVec-FM with varying $k$ values in few-shot learning across four datasets (PROTEINS, NCI109, DD, and Mutagenicity), shaded area represents standard deviation
Figure 4: T-SNE visualization of aligned node embeddings of datasets from different domains.
Figure 5: The few-shot graph classification accuracy in datasets with node attributes when the number of global multi-graphs increases from 1 to 6.
...and 2 more figures

Theorems & Definitions (18)

Theorem 4.1
Theorem 4.2
proof
Lemma E.1: McDiarmid's inequality mcdiarmid1989method
Lemma E.2
proof
Lemma E.3
Lemma E.4
proof
Lemma E.5
...and 8 more

Training A Foundation Model to Represent Graphs as Vectors

TL;DR

Abstract

Training A Foundation Model to Represent Graphs as Vectors

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (18)