Table of Contents
Fetching ...

Which Modality should I use -- Text, Motif, or Image? : Understanding Graphs with Large Language Models

Debarati Das, Ishaan Gupta, Jaideep Srivastava, Dongyeop Kang

TL;DR

This work tackles the challenge of enabling large language models to reason about graph-structured data under strict context-window constraints. It proposes multimodal graph encodings (text, motif, image) and introduces GraphTMI, a benchmark with prompts and modality-specific data to study node classification. Key findings show that image-based encoding with vision-language models like GPT-4V achieves favorable token efficiency and information preservation, often outperforming prior GNN encoders, while task difficulty guides modality choice (image for medium-hard; motif for hard). The results highlight both the potential of LLMs in graph understanding and the need for combining modalities or integrating with GNNs for strong performance on real-world graphs.

Abstract

Our research integrates graph data with Large Language Models (LLMs), which, despite their advancements in various fields using large text corpora, face limitations in encoding entire graphs due to context size constraints. This paper introduces a new approach to encoding a graph with diverse modalities, such as text, image, and motif, coupled with prompts to approximate a graph's global connectivity, thereby enhancing LLMs' efficiency in processing complex graph structures. The study also presents GraphTMI, a novel benchmark for evaluating LLMs in graph structure analysis, focusing on homophily, motif presence, and graph difficulty. Key findings indicate that the image modality, especially with vision-language models like GPT-4V, is superior to text in balancing token limits and preserving essential information and outperforms prior graph neural net (GNN) encoders. Furthermore, the research assesses how various factors affect the performance of each encoding modality and outlines the existing challenges and potential future developments for LLMs in graph understanding and reasoning tasks. All data will be publicly available upon acceptance.

Which Modality should I use -- Text, Motif, or Image? : Understanding Graphs with Large Language Models

TL;DR

This work tackles the challenge of enabling large language models to reason about graph-structured data under strict context-window constraints. It proposes multimodal graph encodings (text, motif, image) and introduces GraphTMI, a benchmark with prompts and modality-specific data to study node classification. Key findings show that image-based encoding with vision-language models like GPT-4V achieves favorable token efficiency and information preservation, often outperforming prior GNN encoders, while task difficulty guides modality choice (image for medium-hard; motif for hard). The results highlight both the potential of LLMs in graph understanding and the need for combining modalities or integrating with GNNs for strong performance on real-world graphs.

Abstract

Our research integrates graph data with Large Language Models (LLMs), which, despite their advancements in various fields using large text corpora, face limitations in encoding entire graphs due to context size constraints. This paper introduces a new approach to encoding a graph with diverse modalities, such as text, image, and motif, coupled with prompts to approximate a graph's global connectivity, thereby enhancing LLMs' efficiency in processing complex graph structures. The study also presents GraphTMI, a novel benchmark for evaluating LLMs in graph structure analysis, focusing on homophily, motif presence, and graph difficulty. Key findings indicate that the image modality, especially with vision-language models like GPT-4V, is superior to text in balancing token limits and preserving essential information and outperforms prior graph neural net (GNN) encoders. Furthermore, the research assesses how various factors affect the performance of each encoding modality and outlines the existing challenges and potential future developments for LLMs in graph understanding and reasoning tasks. All data will be publicly available upon acceptance.
Paper Structure (23 sections, 5 equations, 16 figures, 12 tables)

This paper contains 23 sections, 5 equations, 16 figures, 12 tables.

Figures (16)

  • Figure 1: Input modality encoding for graphs impacts node classification, with text modality offering detailed information from a local point of view but violating the input context limitations for LLMs due to verbosity. Motif modality provides local and global context, while image modality gives a comprehensive global view, efficiently processed by GPT-4V, which integrates capabilities from both vision and text.
  • Figure 2: Node Classification on a Graph using different input modality encodings like Text, Motif, and Image.
  • Figure 3: Image representation changes were applied sequentially on a graph, and we observed a distinct increase from (a) to (f) in human readability and understanding of the graph structure.
  • Figure 4: Classifying graph task difficulty based on the criteria of Homophily and Number of Motifs yields a dataset of EASY, MEDIUM, and HARD graph problems and their associated modality encodings and classifications. This benchmark is called the GraphTMI dataset.
  • Figure 5: We observe that while the text and image modalities have similar accuracy rates, the motif modality exhibits the highest mismatch rate, and the image modality stands out with the lowest denial rate and token limit fraction, as depicted along the mean metrics (y-axis) against each modality type (x-axis)
  • ...and 11 more figures