Table of Contents
Fetching ...

ConGraT: Self-Supervised Contrastive Pretraining for Joint Graph and Text Embeddings

William Brannon, Wonjune Kang, Suyash Fulay, Hang Jiang, Brandon Roy, Deb Roy, Jad Kabbara

TL;DR

ConGraT introduces a self-supervised joint pretraining framework for text-attributed graphs by training a PLM-based text encoder and a GNN-based node encoder to align in a shared embedding space via a CLIP-inspired batch-wise contrastive objective. The method incorporates graph-informed similarity to guide multi-step neighbor relationships and is inductive, encoder-flexible, and task-agnostic. Empirically, ConGraT improves node classification, link prediction, and language modeling across citation, link, and social TAGs, with graph-similarity terms (α) yielding additional gains and enabling more text-grounded community detection. The work demonstrates the practical value of cross-modal pretraining for TAGs and provides evidence of improved cross-modal geometry and retrieval performance, highlighting potential applications in social networks and knowledge graphs while acknowledging ethical and scalability considerations.

Abstract

Learning on text-attributed graphs (TAGs), in which nodes are associated with one or more texts, has been the subject of much recent work. However, most approaches tend to make strong assumptions about the downstream task of interest, are reliant on hand-labeled data, or fail to equally balance the importance of both text and graph representations. In this work, we propose Contrastive Graph-Text pretraining (ConGraT), a general, self-supervised approach for jointly learning separate representations of texts and nodes in a TAG. Our method trains a language model (LM) and a graph neural network (GNN) to align their representations in a common latent space using a batch-wise contrastive learning objective inspired by CLIP. We further propose an extension to the CLIP objective that leverages graph structure to incorporate information about inter-node similarity. Extensive experiments demonstrate that ConGraT outperforms baselines on various downstream tasks, including node and text category classification, link prediction, and language modeling. Finally, we present an application of our method to community detection in social graphs, which enables finding more textually grounded communities, rather than purely graph-based ones. Code and certain datasets are available at https://github.com/wwbrannon/congrat.

ConGraT: Self-Supervised Contrastive Pretraining for Joint Graph and Text Embeddings

TL;DR

ConGraT introduces a self-supervised joint pretraining framework for text-attributed graphs by training a PLM-based text encoder and a GNN-based node encoder to align in a shared embedding space via a CLIP-inspired batch-wise contrastive objective. The method incorporates graph-informed similarity to guide multi-step neighbor relationships and is inductive, encoder-flexible, and task-agnostic. Empirically, ConGraT improves node classification, link prediction, and language modeling across citation, link, and social TAGs, with graph-similarity terms (α) yielding additional gains and enabling more text-grounded community detection. The work demonstrates the practical value of cross-modal pretraining for TAGs and provides evidence of improved cross-modal geometry and retrieval performance, highlighting potential applications in social networks and knowledge graphs while acknowledging ethical and scalability considerations.

Abstract

Learning on text-attributed graphs (TAGs), in which nodes are associated with one or more texts, has been the subject of much recent work. However, most approaches tend to make strong assumptions about the downstream task of interest, are reliant on hand-labeled data, or fail to equally balance the importance of both text and graph representations. In this work, we propose Contrastive Graph-Text pretraining (ConGraT), a general, self-supervised approach for jointly learning separate representations of texts and nodes in a TAG. Our method trains a language model (LM) and a graph neural network (GNN) to align their representations in a common latent space using a batch-wise contrastive learning objective inspired by CLIP. We further propose an extension to the CLIP objective that leverages graph structure to incorporate information about inter-node similarity. Extensive experiments demonstrate that ConGraT outperforms baselines on various downstream tasks, including node and text category classification, link prediction, and language modeling. Finally, we present an application of our method to community detection in social graphs, which enables finding more textually grounded communities, rather than purely graph-based ones. Code and certain datasets are available at https://github.com/wwbrannon/congrat.
Paper Structure (51 sections, 3 equations, 5 figures, 13 tables)

This paper contains 51 sections, 3 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Embeddings of graph nodes in red (e.g., Twitter users), and their associated texts in blue (e.g., tweets). They are placed into a common embedding space, with nodes near their associated texts. Node-text pairs are labeled N1 to N5. Note that not every node must have an associated text (here, N5 does not).
  • Figure 2: The overall architecture of our model. Given a minibatch of (text, origin node) pairs, node and text embeddings are generated by their respective encoders, then used to compute pairwise cosine similarities. The final loss is the average of cross entropies along each row and column of the similarity matrix, with each row $i$'s target probabilities (labeled $\mathbb{D}_T^{(i)}$ and $\mathbb{D}_G^{(i)}$) a mixture of the true targets (on the diagonal) and a (row- or column-specific) distribution proportional to a graph-based similarity measure.
  • Figure 3: 2D UMAP visualizations of GAT and ConGraT ($\alpha = 0.0$) embeddings on the Twitter data subset with U.S. political party labels (blue = Democrat, orange = Republican).
  • Figure 4: Test-set AUCs for predictions of community labels from text embeddings on the Twitter dataset. "Louvain" denotes Louvain communities detected in the follow graph, "Baseline" the GAT baseline model, and "ConGraT" our model with $\alpha = 0.0$.
  • Figure 5: Top-k accuracy on selection of the node which produced a text, for various values of $k$, as discussed in \ref{['subsec:appendix-embedding-geometry-retrieval']}. "Baseline" indicates the use of separately pretrained embeddings, and other results are for models with various combinations of edge-direction use and graph-similarity information.