Billion-Scale Graph Foundation Models

Maya Bechler-Speicher; Yoel Gottlieb; Andrey Isakov; David Abensur; Ami Tavory; Daniel Haimovich; Ido Guy; Udi Weinsberg

Billion-Scale Graph Foundation Models

Maya Bechler-Speicher, Yoel Gottlieb, Andrey Isakov, David Abensur, Ami Tavory, Daniel Haimovich, Ido Guy, Udi Weinsberg

TL;DR

This work introduces GraphBFF, the first end-to-end recipe for billion-parameter Graph Foundation Models (GFMs) that operate on billion-scale heterogeneous graphs. At its core is the GraphBFF Transformer, which fuses Type-Conditioned Attention and Type-Agnostic Attention with a sparse softmax to achieve scalable expressivity for complex graphs. The authors establish neural scaling laws for GFMs, showing predictable loss reductions when jointly scaling model size $N$ and data size $D$, and they demonstrate a 1.4B-parameter GraphBFF pretrained on one billion edges achieving strong zero-shot, few-shot, and probing performance across ten downstream tasks unseen during training. Practical contributions include novel batching strategies (KL-Batching and Round-Robin Batching) and fine-tuning methods, along with a rigorous analysis of when and how GFMs improve over task-specific baselines. The results highlight coupled data–model bottlenecks and offer a principled blueprint for deploying industrial-scale GFMs on heterogeneous graphs, while outlining key open questions and opportunities for future work.

Abstract

Graph-structured data underpins many critical applications. While foundation models have transformed language and vision via large-scale pretraining and lightweight adaptation, extending this paradigm to general, real-world graphs is challenging. In this work, we present Graph Billion- Foundation-Fusion (GraphBFF): the first end-to-end recipe for building billion-parameter Graph Foundation Models (GFMs) for arbitrary heterogeneous, billion-scale graphs. Central to the recipe is the GraphBFF Transformer, a flexible and scalable architecture designed for practical billion-scale GFMs. Using the GraphBFF, we present the first neural scaling laws for general graphs and show that loss decreases predictably as either model capacity or training data scales, depending on which factor is the bottleneck. The GraphBFF framework provides concrete methodologies for data batching, pretraining, and fine-tuning for building GFMs at scale. We demonstrate the effectiveness of the framework with an evaluation of a 1.4 billion-parameter GraphBFF Transformer pretrained on one billion samples. Across ten diverse, real-world downstream tasks on graphs unseen during training, spanning node- and link-level classification and regression, GraphBFF achieves remarkable zero-shot and probing performance, including in few-shot settings, with large margins of up to 31 PRAUC points. Finally, we discuss key challenges and open opportunities for making GFMs a practical and principled foundation for graph learning at industrial scale.

Billion-Scale Graph Foundation Models

TL;DR

and data size

, and they demonstrate a 1.4B-parameter GraphBFF pretrained on one billion edges achieving strong zero-shot, few-shot, and probing performance across ten downstream tasks unseen during training. Practical contributions include novel batching strategies (KL-Batching and Round-Robin Batching) and fine-tuning methods, along with a rigorous analysis of when and how GFMs improve over task-specific baselines. The results highlight coupled data–model bottlenecks and offer a principled blueprint for deploying industrial-scale GFMs on heterogeneous graphs, while outlining key open questions and opportunities for future work.

Abstract

Paper Structure (18 sections, 1 theorem, 41 equations, 5 figures, 3 tables)

This paper contains 18 sections, 1 theorem, 41 equations, 5 figures, 3 tables.

Introduction
Related Work
Preliminaries
The GraphBFF Transformer
Pre-Training
Fine-Tuning and Extending $\mathcal{G}$
Probing, Few-Shot and Zero-Shot
Setting and data
Results
Neural Scaling Laws
Setup
Results
Discussion and Future Work
Defining the pre-training universe $\mathcal{G}$
Compute-Optimal Allocation
...and 3 more sections

Key Result

Theorem 4.1

Consider a GraphBFF Transformer layer with hidden dimension $d_\ell=d$ and number of heads $H$, and heterogeneous attention sub-block TCA and TAA. Let $\mathcal{F}_{\mathrm{GraphBFF}}, \mathcal{F}_{\mathrm{TAA}}, \mathcal{F}_{\mathrm{TCA}}$ be the sets of realizable functions by the GraphBFF Transfo

Figures (5)

Figure 1: An illustration of FMs as GFMs over specific topological and feature distributions, using the minimal graph structure they operate on. Node colors correspond to token types, with each type undergoing distinct transformations within the model. The GraphBFF is designed for general graphs and feature distributions, supporting any number of token types.
Figure 2: The GraphBFF Transformer block.
Figure 3: Performance across Tasks 1--10 under different data settings: 1/2/5/10-shot and full-data. Lines show GFM-1/2/3. Markers at the full-data point indicate task-specific baselines with context size encoded by marker shape.
Figure 4: Zero-shot separation for the datasets 1-6. The raw features and GFM embeddings for context sizes 1 2 and 3 are projected to 2D space using PCA, and the points are colored according to their label.
Figure 5: The early-stop validation loss as a function of model size and data size. The dashed lines are obtained by fitting the exponents in \ref{['eq:scaling']}. Loss decreases predictably depending on the model and data size, across 5 orders of magnitudes.

Theorems & Definitions (2)

Theorem 4.1
proof : Proof of Theorem 4.1

Billion-Scale Graph Foundation Models

TL;DR

Abstract

Billion-Scale Graph Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (2)