Table of Contents
Fetching ...

Multimodal Graph Representation Learning with Dynamic Information Pathways

Xiaobin Hong, Mingkai Lin, Xiaoli Wang, Chaoqun Wang, Wenzhong Li

TL;DR

By introducing modality-specific pseudo nodes, DiP enables dynamic message routing within each modality via proximity-guided pseudo-node interactions and captures inter-modality dependence through efficient information pathways in a shared state space.

Abstract

Multimodal graphs, where nodes contain heterogeneous features such as images and text, are increasingly common in real-world applications. Effectively learning on such graphs requires both adaptive intra-modal message passing and efficient inter-modal aggregation. However, most existing approaches to multimodal graph learning are typically extended from conventional graph neural networks and rely on static structures or dense attention, which limit flexibility and expressive node embedding learning. In this paper, we propose a novel multimodal graph representation learning framework with Dynamic information Pathways (DiP). By introducing modality-specific pseudo nodes, DiP enables dynamic message routing within each modality via proximity-guided pseudo-node interactions and captures inter-modality dependence through efficient information pathways in a shared state space. This design achieves adaptive, expressive, and sparse message propagation across modalities with linear complexity. We conduct the link prediction and node classification tasks to evaluate performance and carry out full experimental analyses. Extensive experiments across multiple benchmarks demonstrate that DiP consistently outperforms baselines.

Multimodal Graph Representation Learning with Dynamic Information Pathways

TL;DR

By introducing modality-specific pseudo nodes, DiP enables dynamic message routing within each modality via proximity-guided pseudo-node interactions and captures inter-modality dependence through efficient information pathways in a shared state space.

Abstract

Multimodal graphs, where nodes contain heterogeneous features such as images and text, are increasingly common in real-world applications. Effectively learning on such graphs requires both adaptive intra-modal message passing and efficient inter-modal aggregation. However, most existing approaches to multimodal graph learning are typically extended from conventional graph neural networks and rely on static structures or dense attention, which limit flexibility and expressive node embedding learning. In this paper, we propose a novel multimodal graph representation learning framework with Dynamic information Pathways (DiP). By introducing modality-specific pseudo nodes, DiP enables dynamic message routing within each modality via proximity-guided pseudo-node interactions and captures inter-modality dependence through efficient information pathways in a shared state space. This design achieves adaptive, expressive, and sparse message propagation across modalities with linear complexity. We conduct the link prediction and node classification tasks to evaluate performance and carry out full experimental analyses. Extensive experiments across multiple benchmarks demonstrate that DiP consistently outperforms baselines.
Paper Structure (21 sections, 8 equations, 6 figures, 6 tables)

This paper contains 21 sections, 8 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: A multimodal ego-graph example from a recommendation system, where its node is attributed with multimodal raw data (i.e., image and text), and the link indicates the complex relations between nodes.
  • Figure 2: The overview framework of DiP. DiP first encodes the raw images and texts with frozen modality encoders for multimodal graph nodes. The recursive $L$-steps multimodal message passing mechanism is the key component of DiP, which consists of the Intra-Modal Diffusion Pathway and Inter-Modal Aggregation Pathway modules, which output expressiveness node representations incorporating the intra- and inter-modal message passing. Finally, the learned multimodal embeddings are fed to the task heads for link prediction and node classification training.
  • Figure 3: (a) DiP with adaptive pathways can maintain the performance as the model depth increases. (b) Message passing pathways. The proximities between sampled graph nodes and pseudo nodes among two modalities. Some pseudo-nodes show link activation (bright rows), which may be a clustering pattern of nodes from different classes.
  • Figure 4: Ablation on the number of visual ($n_{p_v}$) and textual ($n_{p_t}$) pseudo nodes.
  • Figure 5: The T-SNE plots for node embedding from Ele-Fashion dataset.
  • ...and 1 more figures