Table of Contents
Fetching ...

UniGLM: Training One Unified Language Model for Text-Attributed Graph Embedding

Yi Fang, Dongzhe Fan, Sirui Ding, Ninghao Liu, Qiaoyu Tan

TL;DR

UniGLM tackles the challenge of learning generalizable embeddings for text-attributed graphs by pre-training a single language-model-based encoder across multiple TAGs from diverse domains. It introduces an adaptive, learnable positive sampling mechanism and a lazy contrastive module that together enable effective domain-aware contrastive learning while maintaining training efficiency. Empirical results across nine TAG benchmarks show strong in-domain and cross-domain transfer, with UniGLM consistently outperforming state-of-the-art baselines in node classification and link prediction. The approach yields a scalable, cross-domain graph embedding foundation model that leverages a shared textual space to integrate structure across heterogeneous TAGs.

Abstract

Representation learning on text-attributed graphs (TAGs), where nodes are represented by textual descriptions, is crucial for textual and relational knowledge systems and recommendation systems. Currently, state-of-the-art embedding methods for TAGs primarily focus on fine-tuning language models (e.g., BERT) using structure-aware training signals. While effective, these methods are tailored for individual TAG and cannot generalize across various graph scenarios. Given the shared textual space, leveraging multiple TAGs for joint fine-tuning, aligning text and graph structure from different aspects, would be more beneficial. Motivated by this, we introduce a novel Unified Graph Language Model (UniGLM) framework, the first graph embedding model that generalizes well to both in-domain and cross-domain TAGs. Specifically, UniGLM is trained over multiple TAGs with different domains and scales using self-supervised contrastive learning. UniGLM includes an adaptive positive sample selection technique for identifying structurally similar nodes and a lazy contrastive module that is devised to accelerate training by minimizing repetitive encoding calculations. Extensive empirical results across 9 benchmark TAGs demonstrate UniGLM's efficacy against leading embedding baselines in terms of generalization (various downstream tasks and backbones) and transfer learning (in and out of domain scenarios). The code is available at https://github.com/NYUSHCS/UniGLM.

UniGLM: Training One Unified Language Model for Text-Attributed Graph Embedding

TL;DR

UniGLM tackles the challenge of learning generalizable embeddings for text-attributed graphs by pre-training a single language-model-based encoder across multiple TAGs from diverse domains. It introduces an adaptive, learnable positive sampling mechanism and a lazy contrastive module that together enable effective domain-aware contrastive learning while maintaining training efficiency. Empirical results across nine TAG benchmarks show strong in-domain and cross-domain transfer, with UniGLM consistently outperforming state-of-the-art baselines in node classification and link prediction. The approach yields a scalable, cross-domain graph embedding foundation model that leverages a shared textual space to integrate structure across heterogeneous TAGs.

Abstract

Representation learning on text-attributed graphs (TAGs), where nodes are represented by textual descriptions, is crucial for textual and relational knowledge systems and recommendation systems. Currently, state-of-the-art embedding methods for TAGs primarily focus on fine-tuning language models (e.g., BERT) using structure-aware training signals. While effective, these methods are tailored for individual TAG and cannot generalize across various graph scenarios. Given the shared textual space, leveraging multiple TAGs for joint fine-tuning, aligning text and graph structure from different aspects, would be more beneficial. Motivated by this, we introduce a novel Unified Graph Language Model (UniGLM) framework, the first graph embedding model that generalizes well to both in-domain and cross-domain TAGs. Specifically, UniGLM is trained over multiple TAGs with different domains and scales using self-supervised contrastive learning. UniGLM includes an adaptive positive sample selection technique for identifying structurally similar nodes and a lazy contrastive module that is devised to accelerate training by minimizing repetitive encoding calculations. Extensive empirical results across 9 benchmark TAGs demonstrate UniGLM's efficacy against leading embedding baselines in terms of generalization (various downstream tasks and backbones) and transfer learning (in and out of domain scenarios). The code is available at https://github.com/NYUSHCS/UniGLM.
Paper Structure (18 sections, 7 equations, 6 figures, 6 tables)

This paper contains 18 sections, 7 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The proposed UniGLM framework. The UniGLM framework trains a unified graph encoder across multiple TAGs using domain-aware contrastive learning, instead of learning separate language models for each TAG. To ensure effective and efficient textual-to-structure alignment, we introduce an adaptive and learnable positive sample selection scheme and a lazy updating strategy. UniGLM serves as a foundational embedding model for TAGs, consistently delivering strong performance across various downstream tasks and backbones.
  • Figure 2: Link prediction results in AUC metric.
  • Figure 3: Ablation Study: the impact of different sample strategies. Results are the average of MLP, GCN and SAGE backbones.
  • Figure 4: The impact of learnable positive generation scheme on UniGLM. The results are averaged values across three backbones: MLP, GCN, and SAGE.
  • Figure 5: Visualization of the generated positive sample and the corresponding positive candidates on History dataset.
  • ...and 1 more figures