Table of Contents
Fetching ...

Language Model-Enhanced Message Passing for Heterophilic Graph Learning

Wenjun Wang, Dawei Cheng

TL;DR

LEMP4HG tackles heterophilic graph learning by integrating LM-generated connection analyses with SLM-encoded node texts to produce semantically rich messages. It introduces a gating-based fusion of LM messages and node embeddings, and a Modulated Variation of Reliable Distance (MVRD) to drive selective, budget-bounded LM querying. An active-learning component selects the most informative edges to enhance, reducing cost and mitigating interference on homophilic regions. Across 16 real-world text-attributed graphs, LEMP4HG demonstrates robust gains on heterophilic cases and stable performance on homophilic ones, providing practical budget guidelines and insights into LM-assisted graph propagation.

Abstract

Traditional graph neural networks (GNNs), which rely on homophily-driven message passing, struggle with heterophilic graphs where connected nodes exhibit dissimilar features and different labels. While existing methods address heterophily through graph structure refinement or adaptation of neighbor aggregation functions, they often overlook the semantic potential of node text, rely on suboptimal message representation for propagation and compromise performance on homophilic graphs. To address these limitations, we propose a novel language model (LM)-enhanced message passing approach for heterophilic graph leaning (LEMP4HG). Specifically, in the context of text-attributed graph, we provide paired node texts for LM to generate their connection analysis, which are encoded and then fused with paired node textual embeddings through a gating mechanism. The synthesized messages are semantically enriched and adaptively balanced with both nodes' information, which mitigates contradictory signals when neighbor aggregation in heterophilic regions. Furthermore, we introduce an active learning strategy guided by our heuristic MVRD (Modulated Variation of Reliable Distance), selectively enhancing node pairs suffer most from message passing, reducing the cost of analysis generation and side effects on homophilic regions. Extensive experiments validate that our approach excels on heterophilic graphs and performs robustly on homophilic ones, with a graph convolutional network (GCN) backbone and a practical budget.

Language Model-Enhanced Message Passing for Heterophilic Graph Learning

TL;DR

LEMP4HG tackles heterophilic graph learning by integrating LM-generated connection analyses with SLM-encoded node texts to produce semantically rich messages. It introduces a gating-based fusion of LM messages and node embeddings, and a Modulated Variation of Reliable Distance (MVRD) to drive selective, budget-bounded LM querying. An active-learning component selects the most informative edges to enhance, reducing cost and mitigating interference on homophilic regions. Across 16 real-world text-attributed graphs, LEMP4HG demonstrates robust gains on heterophilic cases and stable performance on homophilic ones, providing practical budget guidelines and insights into LM-assisted graph propagation.

Abstract

Traditional graph neural networks (GNNs), which rely on homophily-driven message passing, struggle with heterophilic graphs where connected nodes exhibit dissimilar features and different labels. While existing methods address heterophily through graph structure refinement or adaptation of neighbor aggregation functions, they often overlook the semantic potential of node text, rely on suboptimal message representation for propagation and compromise performance on homophilic graphs. To address these limitations, we propose a novel language model (LM)-enhanced message passing approach for heterophilic graph leaning (LEMP4HG). Specifically, in the context of text-attributed graph, we provide paired node texts for LM to generate their connection analysis, which are encoded and then fused with paired node textual embeddings through a gating mechanism. The synthesized messages are semantically enriched and adaptively balanced with both nodes' information, which mitigates contradictory signals when neighbor aggregation in heterophilic regions. Furthermore, we introduce an active learning strategy guided by our heuristic MVRD (Modulated Variation of Reliable Distance), selectively enhancing node pairs suffer most from message passing, reducing the cost of analysis generation and side effects on homophilic regions. Extensive experiments validate that our approach excels on heterophilic graphs and performs robustly on homophilic ones, with a graph convolutional network (GCN) backbone and a practical budget.

Paper Structure

This paper contains 70 sections, 1 theorem, 18 equations, 7 figures, 9 tables.

Key Result

Theorem 1

Let $\gamma > 0$ and $\sigma: \mathbb{R} \to \mathbb{R}^+$ be strictly increasing. Then the reliable difference $RD_{ij}$ is strictly increasing w.r.t. $d_{ij}$, and strictly decreasing w.r.t. $d_i$ and $d_j$.

Figures (7)

  • Figure 1: Overview of our LEMP4HG. (a) Illustration of embedding shift after message passing; (b) Heuristic definition to measure how much node pair suffer from message passing. Our pipeline includes three parts. (c) Initially, we finetune SLM for textual encoding with MLP as classifier; (d) Every $\mathcal{I}$ epochs, we select edges by MVRD to query LM for connection analysis; (e) Each epoch, we synthesize all encoded analysis and paired node texts to form enhanced messages for GNN training.
  • Figure 2: Rank distribution on 5 dataset categories. Lower the box, more robust the model.
  • Figure 3: Scalability study on Cora, Pubmed, arxiv23 and Children: accuracy v.s. budget
  • Figure 4: (left) Embedding space before and after message passing. (right) Logits distribution.
  • Figure 5: (left) Gate vector that balances the contribution of source and target node embeddings. (right) Similarity matrix between synthesized message $\boldsymbol{m}_{ij}$ and preliminary message $\boldsymbol{h}_{ij}$, source and target node embedding $\boldsymbol{h}_i$, $\boldsymbol{h}_j$. The vertical line separates all node pairs into $y_i=y_j$ and $y_i\neq y_j$.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Proof F.1