Table of Contents
Fetching ...

LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration

Jun Zhao, Can Zu, Hao Xu, Yi Lu, Wei He, Yiwen Ding, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR

LongAgent tackles the challenge of long-context processing by distributing work across a leader and a team of task-specific experts, enabling LLMs to handle inputs well beyond their native context windows. The approach introduces inter-member communication to mitigate hallucinations and coordinates iterative reasoning to extract evidence from documents. Empirical results on Needle-in-a-Haystack PLUS show that a fine-tuned LLaMA-7B implementation of LongAgent can surpass GPT-4 on very long prompts and achieve near-perfect performance on several synthetic tasks, while also offering efficiency advantages through chunked processing. The work also provides a new long-text benchmark and discusses practical limitations and avenues for future enhancement of multi-agent long-text systems.

Abstract

Large language models (LLMs) have demonstrated impressive performance in understanding language and executing complex reasoning tasks. However, LLMs with long context windows have been notorious for their expensive training costs and high inference latency. Even the most advanced models such as GPT-4 and Claude2 often make mistakes when processing inputs of over $100k$ tokens, a phenomenon also known as \textit{lost in the middle}. In this paper, we propose \textsc{LongAgent}, a method based on multi-agent collaboration, which scales LLMs (e.g., LLaMA) to a context of 128K and demonstrates potential superiority in long-text processing compared to GPT-4. In \textsc{LongAgent}, a leader is responsible for understanding user intent and directing team members to acquire information from documents. Due to members' hallucinations, it is non-trivial for a leader to obtain accurate information from the responses of dozens to hundreds of members. To address this, we develop an \textit{inter-member communication} mechanism to resolve response conflicts caused by hallucinations through information sharing. Our experimental results indicate that \textsc{LongAgent} offers a promising alternative for long-text processing. The agent team instantiated with LLaMA-7B achieves significant improvements in tasks such as 128k-long text retrieval, multi-hop question answering, compared to GPT-4.

LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration

TL;DR

LongAgent tackles the challenge of long-context processing by distributing work across a leader and a team of task-specific experts, enabling LLMs to handle inputs well beyond their native context windows. The approach introduces inter-member communication to mitigate hallucinations and coordinates iterative reasoning to extract evidence from documents. Empirical results on Needle-in-a-Haystack PLUS show that a fine-tuned LLaMA-7B implementation of LongAgent can surpass GPT-4 on very long prompts and achieve near-perfect performance on several synthetic tasks, while also offering efficiency advantages through chunked processing. The work also provides a new long-text benchmark and discusses practical limitations and avenues for future enhancement of multi-agent long-text systems.

Abstract

Large language models (LLMs) have demonstrated impressive performance in understanding language and executing complex reasoning tasks. However, LLMs with long context windows have been notorious for their expensive training costs and high inference latency. Even the most advanced models such as GPT-4 and Claude2 often make mistakes when processing inputs of over tokens, a phenomenon also known as \textit{lost in the middle}. In this paper, we propose \textsc{LongAgent}, a method based on multi-agent collaboration, which scales LLMs (e.g., LLaMA) to a context of 128K and demonstrates potential superiority in long-text processing compared to GPT-4. In \textsc{LongAgent}, a leader is responsible for understanding user intent and directing team members to acquire information from documents. Due to members' hallucinations, it is non-trivial for a leader to obtain accurate information from the responses of dozens to hundreds of members. To address this, we develop an \textit{inter-member communication} mechanism to resolve response conflicts caused by hallucinations through information sharing. Our experimental results indicate that \textsc{LongAgent} offers a promising alternative for long-text processing. The agent team instantiated with LLaMA-7B achieves significant improvements in tasks such as 128k-long text retrieval, multi-hop question answering, compared to GPT-4.
Paper Structure (31 sections, 2 equations, 8 figures, 17 tables)

This paper contains 31 sections, 2 equations, 8 figures, 17 tables.

Figures (8)

  • Figure 1: LongAgent collaboration scheme. The input long text (left) is segmented into several chunks and assigned to corresponding members. The Leader receives user instruction (right), breaks them down into the simplest sub-problems, convenes members for discussion, ultimately obtaining answers to all sub-problems, and reasons to make the final response.
  • Figure 2: An Overview of the LongAgent. In step 1, the leader constructs a customized agent team based on the description of the task to be handled. In the second and third steps, the leader organizes the team to gather information from documents and resolve conflicts. This process may continue for multiple rounds until the leader deems enough information has been gathered to generate the final response, which is then exported in the step 4.
  • Figure 3: The Comparison of Results of Needle-in-a-Haystack PLUS in Single-Document Question Answering Setting. Under the LangAgent scheme, our fine-tuned LLaMA2-7B model achieved an average accuracy improvement of $19.53\%$ compared to GPT-4 across the range from 1k to 128k (increasing from $62.00\%$ to $81.53\%$).
  • Figure 4: The Comparison of Results of Needle-in-a-Haystack PLUS in Multi-Document Question Answering Setting. Under the LangAgent scheme, our fine-tuned LLaMA2-7B model achieved an average accuracy improvement of $4.96\%$ compared to GPT-4 across the range from $1k$ to $128k$ (increasing from $50.37\%$ to $55.33\%$).
  • Figure 5: The influence of data recipe on model hallucinations. 'Answer' and 'Reject' represent two types of data. For the former, the documents contain answers to questions; whereas for the latter, they do not.
  • ...and 3 more figures