Table of Contents
Fetching ...

Graph Perceiver IO: A General Architecture for Graph Structured Data

Seyun Bae, Hoyoon Byun, Changdae Oh, Yoon-Sik Cho, Kyungwoo Song

TL;DR

Graph Perceiver IO (GPIO) generalizes the Perceiver IO architecture to graph-structured data by introducing graph-specific input/output arrays that encode both node features and topology through Random Walk Positional Encoding and smoothing-based outputs. GPIO achieves lower space complexity than traditional GNNs, enables simultaneous multimodal processing (GPIO+), and supports diverse tasks—graph classification, node classification, link prediction, and multimodal text/image classification—within a single unified model. Empirical results show GPIO frequently matching or surpassing GNN baselines across multiple benchmarks, and GPIO+ further improves multimodal few-shot performance by using a relational decoder to capture inter-set relationships. The approach offers practical scalability to large, dense graphs and suggests a general pathway for integrating graph structure into modality-agnostic architectures for broad real-world applications.

Abstract

Multimodal machine learning has been widely studied for the development of general intelligence. Recently, the Perceiver and Perceiver IO, show competitive results for diverse dataset domains and tasks. However, recent works, Perceiver and Perceiver IO, have focused on heterogeneous modalities, including image, text, and there are few research works for graph structured datasets. A graph has an adjacency matrix different from other datasets such as text and image, and it is not trivial to handle the topological information. In this study, we provide a Graph Perceiver IO (GPIO), the Perceiver IO for the graph structured dataset. We keep the main structure of the GPIO as the Perceiver IO because the Perceiver IO already handles the diverse dataset well, except for the graph structured dataset. The GPIO is a general method that handles diverse datasets, such as graph-structured data, text, and images, by leveraging positional encoding and output query smoothing. Compared to graph neural networks (GNNs), GPIO requires lower complexity and can efficiently incorporate global and local information, which is also empirically validated through experiments. Furthermore, we propose GPIO+ for the multimodal few-shot classification that incorporates both images and graphs simultaneously. GPIO achieves higher benchmark accuracy than GNNs across multiple tasks, including graph classification, node classification, and multimodal text classification, while also attaining superior AP and AUC in link prediction. Additionally, GPIO+ outperforms GNNs in multimodal few-shot classification. Our GPIO(+) can serve as a general architecture for handling various modalities and tasks.

Graph Perceiver IO: A General Architecture for Graph Structured Data

TL;DR

Graph Perceiver IO (GPIO) generalizes the Perceiver IO architecture to graph-structured data by introducing graph-specific input/output arrays that encode both node features and topology through Random Walk Positional Encoding and smoothing-based outputs. GPIO achieves lower space complexity than traditional GNNs, enables simultaneous multimodal processing (GPIO+), and supports diverse tasks—graph classification, node classification, link prediction, and multimodal text/image classification—within a single unified model. Empirical results show GPIO frequently matching or surpassing GNN baselines across multiple benchmarks, and GPIO+ further improves multimodal few-shot performance by using a relational decoder to capture inter-set relationships. The approach offers practical scalability to large, dense graphs and suggests a general pathway for integrating graph structure into modality-agnostic architectures for broad real-world applications.

Abstract

Multimodal machine learning has been widely studied for the development of general intelligence. Recently, the Perceiver and Perceiver IO, show competitive results for diverse dataset domains and tasks. However, recent works, Perceiver and Perceiver IO, have focused on heterogeneous modalities, including image, text, and there are few research works for graph structured datasets. A graph has an adjacency matrix different from other datasets such as text and image, and it is not trivial to handle the topological information. In this study, we provide a Graph Perceiver IO (GPIO), the Perceiver IO for the graph structured dataset. We keep the main structure of the GPIO as the Perceiver IO because the Perceiver IO already handles the diverse dataset well, except for the graph structured dataset. The GPIO is a general method that handles diverse datasets, such as graph-structured data, text, and images, by leveraging positional encoding and output query smoothing. Compared to graph neural networks (GNNs), GPIO requires lower complexity and can efficiently incorporate global and local information, which is also empirically validated through experiments. Furthermore, we propose GPIO+ for the multimodal few-shot classification that incorporates both images and graphs simultaneously. GPIO achieves higher benchmark accuracy than GNNs across multiple tasks, including graph classification, node classification, and multimodal text classification, while also attaining superior AP and AUC in link prediction. Additionally, GPIO+ outperforms GNNs in multimodal few-shot classification. Our GPIO(+) can serve as a general architecture for handling various modalities and tasks.
Paper Structure (44 sections, 11 equations, 10 figures, 16 tables)

This paper contains 44 sections, 11 equations, 10 figures, 16 tables.

Figures (10)

  • Figure 1: A graph-structured dataset splits into a node feature matrix and an adjacency matrix. GPIO construct the input array and the output query array to integrates graph data. Note that the node feature matrix can be text and image feature in multimodal learning. (i) Input array: Concatenation of node feature and random walk positional embedding ($RWPE$), where $t$ denotes the dimension of $RWPE$. (ii) Output query array: A random initialized $D_q$ dimensional vector used for the graph classification task, and ($M$, $D_q$) for node classification, link prediction and multimodal text classification. The output query array can be learnable or fixed, and c.l.f denotes the classifier.
  • Figure 2: Overall structure of GPIO. First, the cross-attention between the initial latent and input query enables the latent to absorb the necessary information from the input. Then, the latent progressively encode the salient feature of a given data point through repeated self-attention blocks. Finally, the output query array and final latent communicate via cross-attention to make proper output for each task. $D_{q}$ is an arbitrarily configurable output query array dimension, and $E$ is the number of classes about a given task. The number of depth or layer of self attention block is a controllable hyperparameters.
  • Figure 3: Overall structure of GPIO+. The red dashed area is the relational decoder introduced for the extension to GPIO+. $\widehat{A}_{1}$ gives rich information to classify $\mathcal{Q}$. The second decoder performs multimodal few-shot image classification.
  • Figure 4: t-SNE van2008visualizing visualization of learned nodes embedded by the GPIO (left), APPNP (middle), GAT (right) on PubMed dataset. The learned embedding of GPIO shows a relatively large uniformity compared to the APPNP and GAT. Large uniformity denotes that a feature distribution utilizes maximal information, and it has a positive correlation with the downstream task performance wang2020understanding.
  • Figure 5: The pairwise shortest path heatmap on PubMed pubmed dataset. The left side represents before 2-hops away score propagation and the right side represents after.
  • ...and 5 more figures