Table of Contents
Fetching ...

Microstructures and Accuracy of Graph Recall by Large Language Models

Yanbang Wang, Hejie Cui, Jon Kleinberg

TL;DR

This work performs the first systematical study of graph recall by LLMs, investigating the accuracy and biased microstructures (local structural patterns) in their recall and finds that more advanced LLMs have a striking dependence on the domain that a real-world graph comes from.

Abstract

Graphs data is crucial for many applications, and much of it exists in the relations described in textual format. As a result, being able to accurately recall and encode a graph described in earlier text is a basic yet pivotal ability that LLMs need to demonstrate if they are to perform reasoning tasks that involve graph-structured information. Human performance at graph recall has been studied by cognitive scientists for decades, and has been found to often exhibit certain structural patterns of bias that align with human handling of social relationships. To date, however, we know little about how LLMs behave in analogous graph recall tasks: do their recalled graphs also exhibit certain biased patterns, and if so, how do they compare with humans and affect other graph reasoning tasks? In this work, we perform the first systematical study of graph recall by LLMs, investigating the accuracy and biased microstructures (local structural patterns) in their recall. We find that LLMs not only underperform often in graph recall, but also tend to favor more triangles and alternating 2-paths. Moreover, we find that more advanced LLMs have a striking dependence on the domain that a real-world graph comes from -- by yielding the best recall accuracy when the graph is narrated in a language style consistent with its original domain.

Microstructures and Accuracy of Graph Recall by Large Language Models

TL;DR

This work performs the first systematical study of graph recall by LLMs, investigating the accuracy and biased microstructures (local structural patterns) in their recall and finds that more advanced LLMs have a striking dependence on the domain that a real-world graph comes from.

Abstract

Graphs data is crucial for many applications, and much of it exists in the relations described in textual format. As a result, being able to accurately recall and encode a graph described in earlier text is a basic yet pivotal ability that LLMs need to demonstrate if they are to perform reasoning tasks that involve graph-structured information. Human performance at graph recall has been studied by cognitive scientists for decades, and has been found to often exhibit certain structural patterns of bias that align with human handling of social relationships. To date, however, we know little about how LLMs behave in analogous graph recall tasks: do their recalled graphs also exhibit certain biased patterns, and if so, how do they compare with humans and affect other graph reasoning tasks? In this work, we perform the first systematical study of graph recall by LLMs, investigating the accuracy and biased microstructures (local structural patterns) in their recall. We find that LLMs not only underperform often in graph recall, but also tend to favor more triangles and alternating 2-paths. Moreover, we find that more advanced LLMs have a striking dependence on the domain that a real-world graph comes from -- by yielding the best recall accuracy when the graph is narrated in a language style consistent with its original domain.
Paper Structure (28 sections, 2 equations, 9 figures, 7 tables)

This paper contains 28 sections, 2 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Graph recall is a simple task but also a crucial pivot for other graph reasoning tasks.
  • Figure 2: Experimental protocols for analyzing microstructures and accuracy of LLM's graph recall. See Sec.\ref{['subsec:protocols']} for detailed explanations.
  • Figure 3: Different factors that influence LLM's graph recall. (a) - (c): narrative styles. The heatmaps show that more advanced LLMs like GPT-4 yield best recall accuracy when the graph is narrated in a language style consistent with its original domain. (d) - (f): memory clearance. Gemini-Pro appears more sensitive to small noise in context, while GPT's are more robust.
  • Figure 4: Correlation between GPT-3.5's performance at graph recall ($y$) and link prediction ($x$).
  • Figure 5: Graph samples of the Facebook dataset.
  • ...and 4 more figures