Table of Contents
Fetching ...

Tree of Agents: Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning

Song Yu, Xiaofei Xu, Ke Deng, Li Li, Lin Tian

TL;DR

TOA tackles long-context challenges in LLMs by enabling multiple agents to read document chunks along different treeed paths, fostering multi-perspective reasoning. The framework comprises three phases—chunk perception, multi-perspective understanding, and consensus formation—augmented by prefix-hash caching and adaptive pruning to reduce redundancy and compute. Empirical results on DetectiveQA and NovelQA show that TOA with a lightweight 8B model outperforms several baselines and rivals larger commercial models on long-context tasks, while maintaining a low none-rate. The findings demonstrate that reading order and cross-agent collaboration can mitigate position bias and hallucinations, enabling efficient and robust long-context understanding with practical deployment implications.

Abstract

Large language models (LLMs) face persistent challenges when handling long-context tasks, most notably the lost in the middle issue, where information located in the middle of a long input tends to be underutilized. Some existing methods that reduce input have the risk of discarding key information, while others that extend context windows often lead to attention dispersion. To address these limitations, we propose Tree of Agents (TOA), a multi-agent reasoning framework that segments the input into chunks processed by independent agents. Each agent generates its local cognition, then agents dynamically exchange information for collaborative reasoning along tree-structured paths. TOA enables agents to probe different reasoning orders for multi-perspective understanding, effectively mitigating position bias and reducing hallucinations. To improve processing efficiency, we incorporate prefix-hash caching and adaptive pruning strategies, achieving significant performance improvements with comparable API overhead. Experiments show that TOA, powered by compact LLaMA3.1-8B, significantly outperforms multiple baselines and demonstrates comparable performance to the latest and much larger commercial models, such as Gemini1.5-pro, on various long-context tasks. Code is available at https://github.com/Aireduce952/Tree-of-Agents.

Tree of Agents: Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning

TL;DR

TOA tackles long-context challenges in LLMs by enabling multiple agents to read document chunks along different treeed paths, fostering multi-perspective reasoning. The framework comprises three phases—chunk perception, multi-perspective understanding, and consensus formation—augmented by prefix-hash caching and adaptive pruning to reduce redundancy and compute. Empirical results on DetectiveQA and NovelQA show that TOA with a lightweight 8B model outperforms several baselines and rivals larger commercial models on long-context tasks, while maintaining a low none-rate. The findings demonstrate that reading order and cross-agent collaboration can mitigate position bias and hallucinations, enabling efficient and robust long-context understanding with practical deployment implications.

Abstract

Large language models (LLMs) face persistent challenges when handling long-context tasks, most notably the lost in the middle issue, where information located in the middle of a long input tends to be underutilized. Some existing methods that reduce input have the risk of discarding key information, while others that extend context windows often lead to attention dispersion. To address these limitations, we propose Tree of Agents (TOA), a multi-agent reasoning framework that segments the input into chunks processed by independent agents. Each agent generates its local cognition, then agents dynamically exchange information for collaborative reasoning along tree-structured paths. TOA enables agents to probe different reasoning orders for multi-perspective understanding, effectively mitigating position bias and reducing hallucinations. To improve processing efficiency, we incorporate prefix-hash caching and adaptive pruning strategies, achieving significant performance improvements with comparable API overhead. Experiments show that TOA, powered by compact LLaMA3.1-8B, significantly outperforms multiple baselines and demonstrates comparable performance to the latest and much larger commercial models, such as Gemini1.5-pro, on various long-context tasks. Code is available at https://github.com/Aireduce952/Tree-of-Agents.

Paper Structure

This paper contains 29 sections, 7 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: The core difference between TOA and other multi-agent reasoning methods. (a) COA processes chunks sequentially, with a final manager agent making the decision. (b) LONGAGENT uses a leader agent to coordinate multi-turn discussions with others. (c) TOA probes multi-paths in a tree structure to prompt multi-perspective reasoning.
  • Figure 2: An overview of TOA. In Phase 1, the document is split into chunks, with each agent processing a chunk and providing cognition, which are stored in $\mathcal{M}$. In Phase 2, agents exchange cognition, express interest in reading additional chunks, and probe additional chunks in different orders. In Phase 3, each agent generates a local answer, and the final answer is determined by majority voting.
  • Figure 3: Needle-in-a-Haystack Single-Needle QA results. With TOA, we achieve up to more than 50% performance improvement compared to the baselines, when the length of haystack changes from 1k to 128k, using the same base model LLama3.1-8B. The percentage value on the y-axis represents the depth percentages of Needle. The bold black numbers in each subfigure indicate the average score.
  • Figure 4: Needle-in-a-Haystack Multi-Needle QA results. With TOA, we achieve up to more than 100% performance improvement compared to the baselines, when the length of haystack changes from 1k to 128k, using the same base model LLama3.1-8B. The two percentages on the y-axis represent the depth of Needle 1 and Needle 2, respectively. The bold black numbers in each subfigure indicate the average score.
  • Figure 5: Performance of COA and TOA on DetectiveQA dataset. TOA is more robust to longer inputs.