Table of Contents
Fetching ...

GT2Vec: Large Language Models as Multi-Modal Encoders for Text and Graph-Structured Data

Jiacheng Lin, Kun Qian, Haoyu Han, Nurendra Choudhary, Tianxin Wei, Zhongruo Wang, Sahika Genc, Edward W Huang, Sheng Wang, Karthik Subbian, Danai Koutra, Jimeng Sun

TL;DR

GT2Vec addresses the challenge of integrating graph-structured data with text by using LLMs as joint encoders. It projects graph embeddings into the same space as text via a two-layer MLP adapter and employs a contrastive learning objective to align graph-text representations, enabling robust joint embeddings $\,\phi(x, \mathcal{G})\ $. The framework demonstrates strong gains across KG-contextualized QA, graph-text pair classification, and retrieval on six datasets, with ablations confirming the importance of graph context and the alignment mechanism. By leveraging LLMs for multimodal encoding, GT2Vec yields richer representations that improve reasoning over graphs and text, suggesting broad applicability to knowledge-infused NLP tasks and potential extensions to additional modalities.

Abstract

Graph-structured information offers rich contextual information that can enhance language models by providing structured relationships and hierarchies, leading to more expressive embeddings for various applications such as retrieval, question answering, and classification. However, existing methods for integrating graph and text embeddings, often based on Multi-layer Perceptrons (MLPs) or shallow transformers, are limited in their ability to fully exploit the heterogeneous nature of these modalities. To overcome this, we propose GT2Vec, a simple yet effective framework that leverages Large Language Models (LLMs) to jointly encode text and graph data. Specifically, GT2Vec employs an MLP adapter to project graph embeddings into the same space as text embeddings, allowing the LLM to process both modalities jointly. Unlike prior work, we also introduce contrastive learning to align the graph and text spaces more effectively, thereby improving the quality of learned joint embeddings. Empirical results across six datasets spanning three tasks, knowledge graph-contextualized question answering, graph-text pair classification, and retrieval, demonstrate that GT2Vec consistently outperforms existing baselines, achieving significant improvements across multiple datasets. These results highlight GT2Vec's effectiveness in integrating graph and text data. Ablation studies further validate the effectiveness of our method.

GT2Vec: Large Language Models as Multi-Modal Encoders for Text and Graph-Structured Data

TL;DR

GT2Vec addresses the challenge of integrating graph-structured data with text by using LLMs as joint encoders. It projects graph embeddings into the same space as text via a two-layer MLP adapter and employs a contrastive learning objective to align graph-text representations, enabling robust joint embeddings . The framework demonstrates strong gains across KG-contextualized QA, graph-text pair classification, and retrieval on six datasets, with ablations confirming the importance of graph context and the alignment mechanism. By leveraging LLMs for multimodal encoding, GT2Vec yields richer representations that improve reasoning over graphs and text, suggesting broad applicability to knowledge-infused NLP tasks and potential extensions to additional modalities.

Abstract

Graph-structured information offers rich contextual information that can enhance language models by providing structured relationships and hierarchies, leading to more expressive embeddings for various applications such as retrieval, question answering, and classification. However, existing methods for integrating graph and text embeddings, often based on Multi-layer Perceptrons (MLPs) or shallow transformers, are limited in their ability to fully exploit the heterogeneous nature of these modalities. To overcome this, we propose GT2Vec, a simple yet effective framework that leverages Large Language Models (LLMs) to jointly encode text and graph data. Specifically, GT2Vec employs an MLP adapter to project graph embeddings into the same space as text embeddings, allowing the LLM to process both modalities jointly. Unlike prior work, we also introduce contrastive learning to align the graph and text spaces more effectively, thereby improving the quality of learned joint embeddings. Empirical results across six datasets spanning three tasks, knowledge graph-contextualized question answering, graph-text pair classification, and retrieval, demonstrate that GT2Vec consistently outperforms existing baselines, achieving significant improvements across multiple datasets. These results highlight GT2Vec's effectiveness in integrating graph and text data. Ablation studies further validate the effectiveness of our method.

Paper Structure

This paper contains 44 sections, 8 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Overview of GT2Vec framework. Unlike the common use of LLMs for generation tasks, we leverage LLMs to obtain joint embeddings of both text and graph data. We encode the input graph with a GNN, which provides the graph embeddings. The graph embeddings are then transformed into the word embedding space in the large language model. These embeddings are then fed into a large language model, and the outputs are utilized for various downstream tasks.
  • Figure 2: Overview of graph-text alignment through contrastive learning.
  • Figure 3: (a) The effect of LLM backbone choice on accuracy for the CommonsenseQA dataset. The figure shows three series: E5, LLaMA-3, and LLaMA-2, along with a single Mistral-7B model. (b) The effect of graph encoder depth (number of GNN layers) on test accuracy for CommonsenseQA and OpenBookQA datasets. The shaded areas represent the standard deviation, indicating the variance in performance across different trials. (c) Graph-text embedding distance (red dashed) and dev accuracy (purple solid) on CommonsenseQA across training epochs.