Graph Perceiver IO: A General Architecture for Graph Structured Data
Seyun Bae, Hoyoon Byun, Changdae Oh, Yoon-Sik Cho, Kyungwoo Song
TL;DR
Graph Perceiver IO (GPIO) generalizes the Perceiver IO architecture to graph-structured data by introducing graph-specific input/output arrays that encode both node features and topology through Random Walk Positional Encoding and smoothing-based outputs. GPIO achieves lower space complexity than traditional GNNs, enables simultaneous multimodal processing (GPIO+), and supports diverse tasks—graph classification, node classification, link prediction, and multimodal text/image classification—within a single unified model. Empirical results show GPIO frequently matching or surpassing GNN baselines across multiple benchmarks, and GPIO+ further improves multimodal few-shot performance by using a relational decoder to capture inter-set relationships. The approach offers practical scalability to large, dense graphs and suggests a general pathway for integrating graph structure into modality-agnostic architectures for broad real-world applications.
Abstract
Multimodal machine learning has been widely studied for the development of general intelligence. Recently, the Perceiver and Perceiver IO, show competitive results for diverse dataset domains and tasks. However, recent works, Perceiver and Perceiver IO, have focused on heterogeneous modalities, including image, text, and there are few research works for graph structured datasets. A graph has an adjacency matrix different from other datasets such as text and image, and it is not trivial to handle the topological information. In this study, we provide a Graph Perceiver IO (GPIO), the Perceiver IO for the graph structured dataset. We keep the main structure of the GPIO as the Perceiver IO because the Perceiver IO already handles the diverse dataset well, except for the graph structured dataset. The GPIO is a general method that handles diverse datasets, such as graph-structured data, text, and images, by leveraging positional encoding and output query smoothing. Compared to graph neural networks (GNNs), GPIO requires lower complexity and can efficiently incorporate global and local information, which is also empirically validated through experiments. Furthermore, we propose GPIO+ for the multimodal few-shot classification that incorporates both images and graphs simultaneously. GPIO achieves higher benchmark accuracy than GNNs across multiple tasks, including graph classification, node classification, and multimodal text classification, while also attaining superior AP and AUC in link prediction. Additionally, GPIO+ outperforms GNNs in multimodal few-shot classification. Our GPIO(+) can serve as a general architecture for handling various modalities and tasks.
