Table of Contents
Fetching ...

Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning

Can Jin, Hongwu Peng, Qixin Zhang, Yujin Tang, Dimitris N. Metaxas, Tong Che

TL;DR

The paper tackles the challenge of scaling collaborative reasoning in multi-agent LLM systems by building an adaptive test-time scaling framework. It introduces M500, a 500-trace MAS dataset, and fine-tunes Qwen2.5-32B-Instruct to create M1-32B for enhanced multi-agent collaboration, complemented by a CEO agent that dynamically manages discussion and resources. Through open-source evaluation in AgentVerse across general understanding, mathematical reasoning, and coding, the approach yields substantial gains over strong baselines and approaches state-of-the-art performance on several tasks. The work demonstrates the value of learned collaboration and adaptive coordination in MAS and provides reproducible data and code to advance research in this area.

Abstract

Multi-agent systems (MAS) built on large language models (LLMs) offer a promising path toward solving complex, real-world tasks that single-agent systems often struggle to manage. While recent advancements in test-time scaling (TTS) have significantly improved single-agent performance on challenging reasoning tasks, how to effectively scale collaboration and reasoning in MAS remains an open question. In this work, we introduce an adaptive multi-agent framework designed to enhance collaborative reasoning through both model-level training and system-level coordination. We construct M500, a high-quality dataset containing 500 multi-agent collaborative reasoning traces, and fine-tune Qwen2.5-32B-Instruct on this dataset to produce M1-32B, a model optimized for multi-agent collaboration. To further enable adaptive reasoning, we propose a novel CEO agent that dynamically manages the discussion process, guiding agent collaboration and adjusting reasoning depth for more effective problem-solving. Evaluated in an open-source MAS across a range of tasks-including general understanding, mathematical reasoning, and coding-our system significantly outperforms strong baselines. For instance, M1-32B achieves 12% improvement on GPQA-Diamond, 41% on AIME2024, and 10% on MBPP-Sanitized, matching the performance of state-of-the-art models like DeepSeek-R1 on some tasks. These results highlight the importance of both learned collaboration and adaptive coordination in scaling multi-agent reasoning. Code is available at https://github.com/jincan333/MAS-TTS

Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning

TL;DR

The paper tackles the challenge of scaling collaborative reasoning in multi-agent LLM systems by building an adaptive test-time scaling framework. It introduces M500, a 500-trace MAS dataset, and fine-tunes Qwen2.5-32B-Instruct to create M1-32B for enhanced multi-agent collaboration, complemented by a CEO agent that dynamically manages discussion and resources. Through open-source evaluation in AgentVerse across general understanding, mathematical reasoning, and coding, the approach yields substantial gains over strong baselines and approaches state-of-the-art performance on several tasks. The work demonstrates the value of learned collaboration and adaptive coordination in MAS and provides reproducible data and code to advance research in this area.

Abstract

Multi-agent systems (MAS) built on large language models (LLMs) offer a promising path toward solving complex, real-world tasks that single-agent systems often struggle to manage. While recent advancements in test-time scaling (TTS) have significantly improved single-agent performance on challenging reasoning tasks, how to effectively scale collaboration and reasoning in MAS remains an open question. In this work, we introduce an adaptive multi-agent framework designed to enhance collaborative reasoning through both model-level training and system-level coordination. We construct M500, a high-quality dataset containing 500 multi-agent collaborative reasoning traces, and fine-tune Qwen2.5-32B-Instruct on this dataset to produce M1-32B, a model optimized for multi-agent collaboration. To further enable adaptive reasoning, we propose a novel CEO agent that dynamically manages the discussion process, guiding agent collaboration and adjusting reasoning depth for more effective problem-solving. Evaluated in an open-source MAS across a range of tasks-including general understanding, mathematical reasoning, and coding-our system significantly outperforms strong baselines. For instance, M1-32B achieves 12% improvement on GPQA-Diamond, 41% on AIME2024, and 10% on MBPP-Sanitized, matching the performance of state-of-the-art models like DeepSeek-R1 on some tasks. These results highlight the importance of both learned collaboration and adaptive coordination in scaling multi-agent reasoning. Code is available at https://github.com/jincan333/MAS-TTS

Paper Structure

This paper contains 26 sections, 35 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: An illustration of a multi-agent collaborative reasoning data sample from M500.
  • Figure 2: Distributions of key statistics in M500: question category (filtered with count $>10$), predicted number of experts required for solving each problem, and solution token usage.
  • Figure 3: Overview of integrating the CEO agent into an existing MAS, using AgentVerse agentverse as an example. The CEO agent adaptively scales collaboration and reasoning by adjusting the number of agents, termination conditions, and reasoning depth.
  • Figure 4: An "aha" moment in MAS where the CEO agent proactively verifies and corrects the solution provided by the Problem Solver. After identifying an error, the CEO suggests a corrected approach, which the Problem Solver then incorporates into its revised solution.
  • Figure 5: The effect of scale collaboration in AgentVerse using M1-32B by increasing the total iteration, critic iteration, and total agents involved in the MAS.
  • ...and 3 more figures