Table of Contents
Fetching ...

GraphCogent: Mitigating LLMs' Working Memory Constraints via Multi-Agent Collaboration in Complex Graph Understanding

Rongzheng Wang, Shuang Liang, Qizhi Chen, Yihong Huang, Muquan Li, Yizhuo Ma, Dongyang Zhang, Ke Qin, Man-Fai Leung

TL;DR

GraphCogent addresses the memory bottlenecks of LLMs in real-world graph reasoning by adopting a cognitive-inspired sensory-buffer-execution architecture and introducing Graph4real, a large-scale, domain-diverse benchmark. The framework decomposes graph reasoning into perception (Sensory), integration (Buffer), and action (Execution) with a hybrid approach of tool-calling and tool-creation to manage diverse representations and dynamic tasks. Key contributions include a Graph N-back memory test, a Graph Verifier to ensure transformation reliability, a cross-format Buffer Module, and a two-stage Execution Agent with CMPO-guided tool discrimination and a Tool Creator for on-demand tool synthesis. Experimental results demonstrate robust, scalable performance across four real-world domains, with high accuracy, substantial token savings, and strong cross-dataset generalization, highlighting practical impact for robust graph reasoning in real-world settings.

Abstract

Large language models (LLMs) show promising performance on small-scale graph reasoning tasks but fail when handling real-world graphs with complex queries. This phenomenon arises from LLMs' working memory constraints, which result in their inability to retain long-range graph topology over extended contexts while sustaining coherent multi-step reasoning. However, real-world graphs are often structurally complex, such as Web, Transportation, Social, and Citation networks. To address these limitations, we propose GraphCogent, a collaborative agent framework inspired by human Working Memory Model that decomposes graph reasoning into specialized cognitive processes: sense, buffer, and execute. The framework consists of three modules: Sensory Module standardizes diverse graph text representations via subgraph sampling, Buffer Module integrates and indexes graph data across multiple formats, and Execution Module combines tool calling and tool creation for efficient reasoning. We also introduce Graph4real, a comprehensive benchmark that contains four domains of real-world graphs (Web, Transportation, Social, and Citation) to evaluate LLMs' graph reasoning capabilities. Our Graph4real covers 21 different graph reasoning tasks, categorized into three types (Structural Querying, Algorithmic Reasoning, and Predictive Modeling tasks), with graph scales up to 10 times larger than existing benchmarks. Experiments show that Llama3.1-8B based GraphCogent achieves a 50% improvement over massive-scale LLMs like DeepSeek-R1 (671B). Compared to state-of-the-art agent-based baseline, our framework outperforms by 20% in accuracy while reducing token usage by 80% for in-toolset tasks and 30% for out-toolset tasks. Code will be available after review.

GraphCogent: Mitigating LLMs' Working Memory Constraints via Multi-Agent Collaboration in Complex Graph Understanding

TL;DR

GraphCogent addresses the memory bottlenecks of LLMs in real-world graph reasoning by adopting a cognitive-inspired sensory-buffer-execution architecture and introducing Graph4real, a large-scale, domain-diverse benchmark. The framework decomposes graph reasoning into perception (Sensory), integration (Buffer), and action (Execution) with a hybrid approach of tool-calling and tool-creation to manage diverse representations and dynamic tasks. Key contributions include a Graph N-back memory test, a Graph Verifier to ensure transformation reliability, a cross-format Buffer Module, and a two-stage Execution Agent with CMPO-guided tool discrimination and a Tool Creator for on-demand tool synthesis. Experimental results demonstrate robust, scalable performance across four real-world domains, with high accuracy, substantial token savings, and strong cross-dataset generalization, highlighting practical impact for robust graph reasoning in real-world settings.

Abstract

Large language models (LLMs) show promising performance on small-scale graph reasoning tasks but fail when handling real-world graphs with complex queries. This phenomenon arises from LLMs' working memory constraints, which result in their inability to retain long-range graph topology over extended contexts while sustaining coherent multi-step reasoning. However, real-world graphs are often structurally complex, such as Web, Transportation, Social, and Citation networks. To address these limitations, we propose GraphCogent, a collaborative agent framework inspired by human Working Memory Model that decomposes graph reasoning into specialized cognitive processes: sense, buffer, and execute. The framework consists of three modules: Sensory Module standardizes diverse graph text representations via subgraph sampling, Buffer Module integrates and indexes graph data across multiple formats, and Execution Module combines tool calling and tool creation for efficient reasoning. We also introduce Graph4real, a comprehensive benchmark that contains four domains of real-world graphs (Web, Transportation, Social, and Citation) to evaluate LLMs' graph reasoning capabilities. Our Graph4real covers 21 different graph reasoning tasks, categorized into three types (Structural Querying, Algorithmic Reasoning, and Predictive Modeling tasks), with graph scales up to 10 times larger than existing benchmarks. Experiments show that Llama3.1-8B based GraphCogent achieves a 50% improvement over massive-scale LLMs like DeepSeek-R1 (671B). Compared to state-of-the-art agent-based baseline, our framework outperforms by 20% in accuracy while reducing token usage by 80% for in-toolset tasks and 30% for out-toolset tasks. Code will be available after review.

Paper Structure

This paper contains 27 sections, 11 equations, 5 figures, 15 tables, 2 algorithms.

Figures (5)

  • Figure 1: Graph N-back Query Task: A graph is split into 50-edge subsets $E_t$. At turn $t+N$, the LLM verifies edge existence in $E_t$. Experimental results are in Section \ref{['Section 3.1']}.
  • Figure 2: Graph N-back Query. Accuracy (right y-axis) measures memory retention across dialogues, computed as sum of True edges (correctly identified existing edges) and False edges (correctly rejected non-existent edges) (left y-axis).
  • Figure 4: Overview of GraphCogent. Sensory Module (left) standardizes various graph text representations through subgraph sampling and conversion; Buffer Module (center) establishes cross-format data (e.g., NetworkX) integrating and indexing transformations; Execution Module (right) enables two reasoning modes: Execution Agent is employed for tool discrimination and implements tool calling for in-toolset tasks, Tool Creator handles out-toolset tasks based on tool creation.
  • Figure 5: Overview of Execution Agent and Tool Creator training. In-toolset task (top) initializes a thinking policy from Think and Tool pairs, applies a CMPO method to refine Execution Agent’s tool capability discrimination; Out-toolset task (bottom) uses the fine-tuned Tool Creator to synthesize minimal task-specific tools, and backfills them into the common toolset.
  • Figure 6: Average Performance on Public Benchmarks (right). Code execution rate on out-toolset tasks (left).