Table of Contents
Fetching ...

Think-on-Graph 3.0: Efficient and Adaptive LLM Reasoning on Heterogeneous Graphs via Multi-Agent Dual-Evolving Context Retrieval

Xiaojun Wu, Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Yuanliang Sun, Hui Xiong, Jia Li, Jian Guo

TL;DR

ToG-3 tackles the bottlenecks of static GraphRAG and LLM-dependent graph extraction by introducing MACER, a Multi-Agent Context Evolution and Retrieval loop, and a Chunk-Triplets-Community heterogeneous graph index. This dual-evolving mechanism adaptively refines both the query and the retrieved subgraph during reasoning, enabling precise evidence gathering even with lightweight LLMs. The approach yields state-of-the-art performance on deep multi-hop benchmarks and robust results on broad-domain tasks, with ablations confirming the critical role of evolving query and graph refinement. Practically, ToG-3 offers a scalable, deployable RAG solution that improves faithfulness and reasoning depth while reducing reliance on large pre-built knowledge graphs.

Abstract

Graph-based Retrieval-Augmented Generation (GraphRAG) has become the important paradigm for enhancing Large Language Models (LLMs) with external knowledge. However, existing approaches are constrained by their reliance on high-quality knowledge graphs: manually built ones are not scalable, while automatically extracted ones are limited by the performance of LLM extractors, especially when using smaller, local-deployed models. To address this, we introduce Think-on-Graph 3.0 (ToG-3), a novel framework featuring a Multi-Agent Context Evolution and Retrieval (MACER) mechanism. Its core contribution is the dynamic construction and iterative refinement of a Chunk-Triplets-Community heterogeneous graph index, powered by a Dual-Evolution process that adaptively evolves both the query and the retrieved sub-graph during reasoning. ToG-3 dynamically builds a targeted graph index tailored to the query, enabling precise evidence retrieval and reasoning even with lightweight LLMs. Extensive experiments demonstrate that ToG-3 outperforms compared baselines on both deep and broad reasoning benchmarks, and ablation studies confirm the efficacy of the components of MACER framework. The source code are available in https://github.com/DataArcTech/ToG-3.

Think-on-Graph 3.0: Efficient and Adaptive LLM Reasoning on Heterogeneous Graphs via Multi-Agent Dual-Evolving Context Retrieval

TL;DR

ToG-3 tackles the bottlenecks of static GraphRAG and LLM-dependent graph extraction by introducing MACER, a Multi-Agent Context Evolution and Retrieval loop, and a Chunk-Triplets-Community heterogeneous graph index. This dual-evolving mechanism adaptively refines both the query and the retrieved subgraph during reasoning, enabling precise evidence gathering even with lightweight LLMs. The approach yields state-of-the-art performance on deep multi-hop benchmarks and robust results on broad-domain tasks, with ablations confirming the critical role of evolving query and graph refinement. Practically, ToG-3 offers a scalable, deployable RAG solution that improves faithfulness and reasoning depth while reducing reliance on large pre-built knowledge graphs.

Abstract

Graph-based Retrieval-Augmented Generation (GraphRAG) has become the important paradigm for enhancing Large Language Models (LLMs) with external knowledge. However, existing approaches are constrained by their reliance on high-quality knowledge graphs: manually built ones are not scalable, while automatically extracted ones are limited by the performance of LLM extractors, especially when using smaller, local-deployed models. To address this, we introduce Think-on-Graph 3.0 (ToG-3), a novel framework featuring a Multi-Agent Context Evolution and Retrieval (MACER) mechanism. Its core contribution is the dynamic construction and iterative refinement of a Chunk-Triplets-Community heterogeneous graph index, powered by a Dual-Evolution process that adaptively evolves both the query and the retrieved sub-graph during reasoning. ToG-3 dynamically builds a targeted graph index tailored to the query, enabling precise evidence retrieval and reasoning even with lightweight LLMs. Extensive experiments demonstrate that ToG-3 outperforms compared baselines on both deep and broad reasoning benchmarks, and ablation studies confirm the efficacy of the components of MACER framework. The source code are available in https://github.com/DataArcTech/ToG-3.

Paper Structure

This paper contains 52 sections, 13 equations, 17 figures, 8 tables, 2 algorithms.

Figures (17)

  • Figure 1: Performance Limitations of Graph-Based RAG systems under Resource-Constrained and Locally-Deployed Scenarios. In such scenarios, developers typically adopt open-source models such as Llama or Qwen as the backbone LLMs. Limitations like incomplete extracted triplets, insufficient extraction details and parsing failure may lead to insufficient knowledge provision, ultimately resulting in failure to adequately answer the query.
  • Figure 2: Evolution of Retrieval-Augmented Generation Paradigms. (a) Naive RAG embeds raw documents and performs single-shot retrieval. (b) Graph-based RAG pre-builds a static graph once and retrieves from it. (c) ToG-3 introduces a four-agent loop—Retriever, Constructor, Reflector, Reranker, Responser—where the graph and the query sub-tasks co-evolve at runtime, yielding dynamic, query-adaptive context that converges to a minimal, sufficient subgraph.
  • Figure 3: Multi-Agent Dual-Evolving Context Retrieval-Response Loop. The Retriever fetches an initial chunk–triplet–community subgraph and the Reranker reranks and selects the top-n most relevant pieces of evidence.. The Response Agent produces an answer; the Reflector Agent judges sufficiency (reward=1/0). If insufficient (reward=0), the Reflector evolves the query into sub-queries while the Constructor evolves the subgraph (sub-graph refinement). The loop repeats until the context becomes sufficient or the horizon is reached, after which the Response Agent synthesizes the final answer from the full trajectory.
  • Figure 4: Performance comparison of different RAG methods on multi-hop QA datasets. (a) Exact Match scores measure the percentage of questions where the model's answer exactly matches the ground truth. (b) F1 scores provide a harmonic mean of precision and recall for token-level answer matching.
  • Figure 5: ELO-based Pairwise Win Rate Matrices Across Four Benchmark Datasets. Each heatmap visualizes win probabilities derived from direct head-to-head experimental comparisons, transformed through the ELO framework to ensure transitive consistency. The diagonal of the heatmap is set to a default value of 0.5, indicating self-comparison of the method.
  • ...and 12 more figures