Table of Contents
Fetching ...

LLMs Prompted for Graphs: Hallucinations and Generative Capabilities

Gurvan Richardeau, Samy Chali, Erwan Le Merrer, Camilla Penzo, Gilles Tredan

TL;DR

This work probes LLMs in two graph-centered tasks to measure hallucinations and emergent generative ability. It first tests the memorization of known graphs by prompting for edge lists and compares outputs to ground truth using topology, embeddings, and a graph-edit-distance framework (Graph Atlas Distance). It then examines generation of Erdős–Rényi random graphs, applying a χ²-based degree-distribution test and syntactic correctness metrics to quantify ER-likeness across 25 LLMs under different prompting strategies. Key findings show pervasive graph hallucinations in recitation, with partial alignment to a hallucination leaderboard for a subset of models, and a notable, though variable, ability to produce structured ER graphs, suggesting an emergent capability that depends on prompting, model family, and parameters. Together, these results motivate graph-centric benchmarks as a resource bridging network science and ML to more precisely evaluate and compare LLMs beyond traditional text-based tasks.

Abstract

Large Language Models (LLMs) are nowadays prompted for a wide variety of tasks. In this article, we investigate their ability in reciting and generating graphs. We first study the ability of LLMs to regurgitate well known graphs from the literature (e.g. Karate club or the graph atlas)4. Secondly, we question the generative capabilities of LLMs by asking for Erdos-Renyi random graphs. As opposed to the possibility that they could memorize some Erdos-Renyi graphs included in their scraped training set, this second investigation aims at studying a possible emergent property of LLMs. For both tasks, we propose a metric to assess their errors with the lens of hallucination (i.e. incorrect information returned as facts). We most notably find that the amplitude of graph hallucinations can characterize the superiority of some LLMs. Indeed, for the recitation task, we observe that graph hallucinations correlate with the Hallucination Leaderboard, a hallucination rank that leverages 10, 000 times more prompts to obtain its ranking. For the generation task, we find surprisingly good and reproducible results in most of LLMs. We believe this to constitute a starting point for more in-depth studies of this emergent capability and a challenging benchmark for their improvements. Altogether, these two aspects of LLMs capabilities bridge a gap between the network science and machine learning communities.

LLMs Prompted for Graphs: Hallucinations and Generative Capabilities

TL;DR

This work probes LLMs in two graph-centered tasks to measure hallucinations and emergent generative ability. It first tests the memorization of known graphs by prompting for edge lists and compares outputs to ground truth using topology, embeddings, and a graph-edit-distance framework (Graph Atlas Distance). It then examines generation of Erdős–Rényi random graphs, applying a χ²-based degree-distribution test and syntactic correctness metrics to quantify ER-likeness across 25 LLMs under different prompting strategies. Key findings show pervasive graph hallucinations in recitation, with partial alignment to a hallucination leaderboard for a subset of models, and a notable, though variable, ability to produce structured ER graphs, suggesting an emergent capability that depends on prompting, model family, and parameters. Together, these results motivate graph-centric benchmarks as a resource bridging network science and ML to more precisely evaluate and compare LLMs beyond traditional text-based tasks.

Abstract

Large Language Models (LLMs) are nowadays prompted for a wide variety of tasks. In this article, we investigate their ability in reciting and generating graphs. We first study the ability of LLMs to regurgitate well known graphs from the literature (e.g. Karate club or the graph atlas)4. Secondly, we question the generative capabilities of LLMs by asking for Erdos-Renyi random graphs. As opposed to the possibility that they could memorize some Erdos-Renyi graphs included in their scraped training set, this second investigation aims at studying a possible emergent property of LLMs. For both tasks, we propose a metric to assess their errors with the lens of hallucination (i.e. incorrect information returned as facts). We most notably find that the amplitude of graph hallucinations can characterize the superiority of some LLMs. Indeed, for the recitation task, we observe that graph hallucinations correlate with the Hallucination Leaderboard, a hallucination rank that leverages 10, 000 times more prompts to obtain its ranking. For the generation task, we find surprisingly good and reproducible results in most of LLMs. We believe this to constitute a starting point for more in-depth studies of this emergent capability and a challenging benchmark for their improvements. Altogether, these two aspects of LLMs capabilities bridge a gap between the network science and machine learning communities.
Paper Structure (31 sections, 4 equations, 7 figures, 6 tables)

This paper contains 31 sections, 4 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Prompting gpt4o for the Zachary's karate club graph. (a) answered output graph, (b) intersection of the output graph with the KC ground truth graph, (c) the edges added (hallucinated) w.r.t. to the KC graph, and (d) the edges missing w.r.t. the KC graph.
  • Figure 2: Example output graphs with salient particularities, when prompted for the Zachary's karate club graph.
  • Figure 3: A t-SNE representation of the KC graph and of LLM outputs.
  • Figure 4: Test Success Rate $\gamma$ and Syntactically Correct Answer Rate $\sigma$ over LLM's outputs on Erdős–Rényi graphs. The 7 $(n,p)$ Erdős–Rényi parameters are aggregated by mean. The experiment includes 25 LLMs and 200 graph queries per LLM-$(n,p)$ pair. Temperature is set to 1.0 and we are here using the CoT Prompt.
  • Figure 5: Erdős–Rényi parameters $(n,p)$ influence on Test Success Rate $\gamma$, with prompt variation. The models $M$ are represented as median and interquartiles. The experiment includes 25 LLMs, 7 $(n,p)$ parameter pairs and 200 graph queries per LLM-$(n,p)$-prompt triplet. Temperature is 1.0.
  • ...and 2 more figures