Table of Contents
Fetching ...

UniGraph2: Learning a Unified Embedding Space to Bind Multimodal Graphs

Yufei He, Yuan Sui, Xiaoxin He, Yue Liu, Yifei Sun, Bryan Hooi

TL;DR

UniGraph2 tackles the limitation of existing foundation models that neglect graph structure in multimodal data by learning a unified embedding space for Multimodal Graphs (MMGs). It introduces modality-specific encoders, a Mixture of Experts (MoE) alignment module, and a Graph Neural Network (GNN) to fuse multimodal features with graph structure, supported by cross-domain multi-graph pre-training and a dual objective that blends feature reconstruction with shortest-path distance (SPD) reconstruction. The framework employs multimodal masking, domain-specific decoders, and a shared SPD decoder, scalable via Personalized PageRank (PPR) sampling for subgraph extraction. Experiments across self-supervised representation learning, few-shot transfer, and multimodal generation demonstrate that UniGraph2 consistently outperforms state-of-the-art baselines, especially in cross-domain and cross-graph settings, while maintaining competitive efficiency at inference. The work advances practical MMG understanding and transfer by providing a foundation model capable of adapting to diverse graphs and modalities without task-specific retraining, with significant implications for web-scale multimodal reasoning and recommendation systems.

Abstract

Existing foundation models, such as CLIP, aim to learn a unified embedding space for multimodal data, enabling a wide range of downstream web-based applications like search, recommendation, and content classification. However, these models often overlook the inherent graph structures in multimodal datasets, where entities and their relationships are crucial. Multimodal graphs (MMGs) represent such graphs where each node is associated with features from different modalities, while the edges capture the relationships between these entities. On the other hand, existing graph foundation models primarily focus on text-attributed graphs (TAGs) and are not designed to handle the complexities of MMGs. To address these limitations, we propose UniGraph2, a novel cross-domain graph foundation model that enables general representation learning on MMGs, providing a unified embedding space. UniGraph2 employs modality-specific encoders alongside a graph neural network (GNN) to learn a unified low-dimensional embedding space that captures both the multimodal information and the underlying graph structure. We propose a new cross-domain multi-graph pre-training algorithm at scale to ensure effective transfer learning across diverse graph domains and modalities. Additionally, we adopt a Mixture of Experts (MoE) component to align features from different domains and modalities, ensuring coherent and robust embeddings that unify the information across modalities. Extensive experiments on a variety of multimodal graph tasks demonstrate that UniGraph2 significantly outperforms state-of-the-art models in tasks such as representation learning, transfer learning, and multimodal generative tasks, offering a scalable and flexible solution for learning on MMGs.

UniGraph2: Learning a Unified Embedding Space to Bind Multimodal Graphs

TL;DR

UniGraph2 tackles the limitation of existing foundation models that neglect graph structure in multimodal data by learning a unified embedding space for Multimodal Graphs (MMGs). It introduces modality-specific encoders, a Mixture of Experts (MoE) alignment module, and a Graph Neural Network (GNN) to fuse multimodal features with graph structure, supported by cross-domain multi-graph pre-training and a dual objective that blends feature reconstruction with shortest-path distance (SPD) reconstruction. The framework employs multimodal masking, domain-specific decoders, and a shared SPD decoder, scalable via Personalized PageRank (PPR) sampling for subgraph extraction. Experiments across self-supervised representation learning, few-shot transfer, and multimodal generation demonstrate that UniGraph2 consistently outperforms state-of-the-art baselines, especially in cross-domain and cross-graph settings, while maintaining competitive efficiency at inference. The work advances practical MMG understanding and transfer by providing a foundation model capable of adapting to diverse graphs and modalities without task-specific retraining, with significant implications for web-scale multimodal reasoning and recommendation systems.

Abstract

Existing foundation models, such as CLIP, aim to learn a unified embedding space for multimodal data, enabling a wide range of downstream web-based applications like search, recommendation, and content classification. However, these models often overlook the inherent graph structures in multimodal datasets, where entities and their relationships are crucial. Multimodal graphs (MMGs) represent such graphs where each node is associated with features from different modalities, while the edges capture the relationships between these entities. On the other hand, existing graph foundation models primarily focus on text-attributed graphs (TAGs) and are not designed to handle the complexities of MMGs. To address these limitations, we propose UniGraph2, a novel cross-domain graph foundation model that enables general representation learning on MMGs, providing a unified embedding space. UniGraph2 employs modality-specific encoders alongside a graph neural network (GNN) to learn a unified low-dimensional embedding space that captures both the multimodal information and the underlying graph structure. We propose a new cross-domain multi-graph pre-training algorithm at scale to ensure effective transfer learning across diverse graph domains and modalities. Additionally, we adopt a Mixture of Experts (MoE) component to align features from different domains and modalities, ensuring coherent and robust embeddings that unify the information across modalities. Extensive experiments on a variety of multimodal graph tasks demonstrate that UniGraph2 significantly outperforms state-of-the-art models in tasks such as representation learning, transfer learning, and multimodal generative tasks, offering a scalable and flexible solution for learning on MMGs.

Paper Structure

This paper contains 22 sections, 12 equations, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Overview of the UniGraph2 framework. In pre-training, 1) UniGraph2 uses frozen Modality-Specific Encoders to encode raw multimodal data (e.g., text, images) into vector node features. Then, a portion of these node features is randomly masked. 2) Considering the diversity of node features across different modalities and graph domains, a Mixture of Experts (MoE) network is used to align the different node features, allowing the model to assign each node to one or more experts based on its domain and modality. 3) The aligned node features are fed into a GNN for learning and projected into a unified embedding space. 4) The decoding involves two objectives: a. Each graph domain corresponds to a specific decoder for reconstructing the node features. b. A shared shortest path distance decoder is used to reconstruct the graph structures.
  • Figure 2: UniGraph2 binds multimodal graphs from different graph domains to a unified embedding space, enabling diverse downstream tasks.

Theorems & Definitions (1)

  • Definition 1: Multimodal Graphs