Graph Mamba: Towards Learning on Graphs with State Space Models

Ali Behrouz; Farnoosh Hashemi

Graph Mamba: Towards Learning on Graphs with State Space Models

Ali Behrouz, Farnoosh Hashemi

TL;DR

GMNs address the quadratic cost of graph transformers and the limitations of message-passing by introducing Graph Mamba Networks, a framework based on selective State Space Models. The design follows a four-step (plus optional PE/SE) recipe: Neighborhood Tokenization, Token Ordering, Local Encoding, and a Bidirectional Selective SSM Encoder, with per-node complexity $O(M\,s\,(m+1))$ and total cost $O(M\,s\,(m+1)\,|V| + |E|)$. The authors establish universality results and demonstrate strong empirical performance across long-range, large-scale, and heterophilic graphs while using less memory than competitive baselines. The work shows that, with careful tokenization and selective SSMs, it is possible to achieve high performance without relying exclusively on attention-based transformers or heavy position/structure encodings.

Abstract

Graph Neural Networks (GNNs) have shown promising potential in graph representation learning. The majority of GNNs define a local message-passing mechanism, propagating information over the graph by stacking multiple layers. These methods, however, are known to suffer from two major limitations: over-squashing and poor capturing of long-range dependencies. Recently, Graph Transformers (GTs) emerged as a powerful alternative to Message-Passing Neural Networks (MPNNs). GTs, however, have quadratic computational cost, lack inductive biases on graph structures, and rely on complex Positional/Structural Encodings (SE/PE). In this paper, we show that while Transformers, complex message-passing, and SE/PE are sufficient for good performance in practice, neither is necessary. Motivated by the recent success of State Space Models (SSMs), such as Mamba, we present Graph Mamba Networks (GMNs), a general framework for a new class of GNNs based on selective SSMs. We discuss and categorize the new challenges when adapting SSMs to graph-structured data, and present four required and one optional steps to design GMNs, where we choose (1) Neighborhood Tokenization, (2) Token Ordering, (3) Architecture of Bidirectional Selective SSM Encoder, (4) Local Encoding, and dispensable (5) PE and SE. We further provide theoretical justification for the power of GMNs. Experiments demonstrate that despite much less computational cost, GMNs attain an outstanding performance in long-range, small-scale, large-scale, and heterophilic benchmark datasets.

Graph Mamba: Towards Learning on Graphs with State Space Models

TL;DR

and total cost

. The authors establish universality results and demonstrate strong empirical performance across long-range, large-scale, and heterophilic graphs while using less memory than competitive baselines. The work shows that, with careful tokenization and selective SSMs, it is possible to achieve high performance without relying exclusively on attention-based transformers or heavy position/structure encodings.

Abstract

Paper Structure (29 sections, 8 theorems, 10 equations, 6 figures, 7 tables, 2 algorithms)

This paper contains 29 sections, 8 theorems, 10 equations, 6 figures, 7 tables, 2 algorithms.

Introduction
Related Work and Backgrounds
Message-Passing Neural Networks
Graph Transformers
State Space Models
Challenges & Motivations: Transformers vs Mamba
Graph Mamba Networks
Tokenization and Encoding
Bidirectional Mamba
Tokenization When $m = 0$
Theoretical Analysis of GMNs
Experiments
Experimental Setup
Long Range Graph Benchmark
Comparison on GNN Benchmark
...and 14 more sections

Key Result

Theorem 4.1

With large enough $M, m,$ and $s > 0$, GMNs' neighborhood sampling is strictly more expressive than $k$-hop neighborhood sampling.

Figures (6)

Figure 1: Schematic of the GMNs with four required and one optional steps: (1) Tokenization: the graph is mapped into a sequence of tokens ($m \geq 1$: subgraph and $m = 0$: node tokenization) (2) (Optional Step) PE/SE: inductive bias is added to the architecture using information about the position of nodes and the strucutre of the graph. (3) Local Encoding: local structures around each node are encoded using a subgraph vectorization mechanism. (4) Token Ordering: the sequence of tokens are ordered based on the context. (Subgraph tokenization ($m \geq 1$) has implicit order and does not need this step). (5) (Stack of) Bidirectional Mamba: it scans and selects relevant nodes or subgraphs to flow into the hidden states. $^\dagger$ In this figure, the last layer of bidirectional Mamba, which performs as a readout on all nodes, is omitted for simplicity.
Figure 2: $\text{Efficiency evaluation and accuracy of GMNs and baselines on OBGN-Arxiv and}$$\text{MalNet-Tiny. Highlighted are the top first, second, and third results. OOM: Out of Memory.}$
Figure 3: Memory of GPS and GMN on MalNet-Tiny dataset.
Figure 4: The effect of (Left) $M$, (Middle) $m$, and (Right) $s$ on the performance of GMNs.
Figure 5: Failure example for methods that are solely based on distance encoding. Solely considering the set of nodes in different distances to the target node misses the connections between them. While the structure of these two graphs are different, the set of nodes with the same distance to node $A$ are the same. Accordingly, GRED ding2023recurrent and S4G s4g achieve the same node encoding for $A$, missing $A$'s neighborhood topology.
...and 1 more figures

Theorems & Definitions (12)

Theorem 4.1
Theorem 4.2: Universality
Theorem 4.3: Expressive Power w/ PE
Theorem 4.4: Expressive Power w/o PE and MPNN
Theorem 7.1
proof
Theorem 7.2: Universality
proof
Theorem 7.3: Expressive Power w/ PE
proof
...and 2 more

Graph Mamba: Towards Learning on Graphs with State Space Models

TL;DR

Abstract

Graph Mamba: Towards Learning on Graphs with State Space Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (12)