Table of Contents
Fetching ...

Can an LLM Induce a Graph? Investigating Memory Drift and Context Length

Raquib Bin Yousuf, Aadyant Khatri, Shengzhe Xu, Mandar Sharma, Naren Ramakrishnan

TL;DR

This paper targets the gap between existing long-context benchmarks and real-world relational reasoning by introducing a graph reconstruction task that requires inducing latent structure from dispersed, noisy text. It formalizes a graph-based prompt framework with edge, subgraph, and clique discovery subtasks and introduces a memory drift metric to capture global degradation beyond token-level accuracy. Across five popular LLMs, the study finds memory drift begins around ~2000 tokens and worsens with connective density, with CoT prompting offering little to no help and even harming performance; reasoning-specialized models do not reliably overcome these effects. The findings highlight significant limitations in current LLMs' ability to abstract structured knowledge from unstructured input and argue for architectural or memory-augmented approaches to enable robust long-range reasoning in practical applications.

Abstract

Recently proposed evaluation benchmarks aim to characterize the effective context length and the forgetting tendencies of large language models (LLMs). However, these benchmarks often rely on simplistic 'needle in a haystack' retrieval or continuation tasks that may not accurately reflect the performance of these models in information-dense scenarios. Thus, rather than simple next token prediction, we argue for evaluating these models on more complex reasoning tasks that requires them to induce structured relational knowledge from the text - such as graphs from potentially noisy natural language content. While the input text can be viewed as generated in terms of a graph, its structure is not made explicit and connections must be induced from distributed textual cues, separated by long contexts and interspersed with irrelevant information. Our findings reveal that LLMs begin to exhibit memory drift and contextual forgetting at much shorter effective lengths when tasked with this form of relational reasoning, compared to what existing benchmarks suggest. With these findings, we offer recommendations for the optimal use of popular LLMs for complex reasoning tasks. We further show that even models specialized for reasoning, such as OpenAI o1, remain vulnerable to early memory drift in these settings. These results point to significant limitations in the models' ability to abstract structured knowledge from unstructured input and highlight the need for architectural adaptations to improve long-range reasoning.

Can an LLM Induce a Graph? Investigating Memory Drift and Context Length

TL;DR

This paper targets the gap between existing long-context benchmarks and real-world relational reasoning by introducing a graph reconstruction task that requires inducing latent structure from dispersed, noisy text. It formalizes a graph-based prompt framework with edge, subgraph, and clique discovery subtasks and introduces a memory drift metric to capture global degradation beyond token-level accuracy. Across five popular LLMs, the study finds memory drift begins around ~2000 tokens and worsens with connective density, with CoT prompting offering little to no help and even harming performance; reasoning-specialized models do not reliably overcome these effects. The findings highlight significant limitations in current LLMs' ability to abstract structured knowledge from unstructured input and argue for architectural or memory-augmented approaches to enable robust long-range reasoning in practical applications.

Abstract

Recently proposed evaluation benchmarks aim to characterize the effective context length and the forgetting tendencies of large language models (LLMs). However, these benchmarks often rely on simplistic 'needle in a haystack' retrieval or continuation tasks that may not accurately reflect the performance of these models in information-dense scenarios. Thus, rather than simple next token prediction, we argue for evaluating these models on more complex reasoning tasks that requires them to induce structured relational knowledge from the text - such as graphs from potentially noisy natural language content. While the input text can be viewed as generated in terms of a graph, its structure is not made explicit and connections must be induced from distributed textual cues, separated by long contexts and interspersed with irrelevant information. Our findings reveal that LLMs begin to exhibit memory drift and contextual forgetting at much shorter effective lengths when tasked with this form of relational reasoning, compared to what existing benchmarks suggest. With these findings, we offer recommendations for the optimal use of popular LLMs for complex reasoning tasks. We further show that even models specialized for reasoning, such as OpenAI o1, remain vulnerable to early memory drift in these settings. These results point to significant limitations in the models' ability to abstract structured knowledge from unstructured input and highlight the need for architectural adaptations to improve long-range reasoning.

Paper Structure

This paper contains 23 sections, 6 equations, 10 figures, 3 tables, 2 algorithms.

Figures (10)

  • Figure 1: Memory drift (lower = better) on the simplest relational task (one connection per sample). Despite minimal complexity, all models degrade beyond a certain context length, showing that even low relational load challenges long-context reasoning. See Table \ref{['tab:model_comparison']} for TL;DR.
  • Figure 2: Overview of our task: given a long, noisy text (left), the model (center) reconstructs the underlying relational graph (right) by identifying connections between entities (edge, subgraph, clique). Disconnected nodes are distractors.
  • Figure 3: Graph-to-prompt pipeline: relational structures (edges, stars, cliques) are sampled from a latent graph and interleaved with distractors to create dispersed, noisy prompts. Red and Blue are targets. Brown are the distractors.
  • Figure 4: Metric trend with GPT-4o and Gemini-2 on edge discovery
  • Figure 5: Metrics for different model and discovery cases, "Score" corresponds to $(1 - \text{Memory Drift})$, allowing comparison with precision, recall, and F1.
  • ...and 5 more figures