Table of Contents
Fetching ...

ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks

Heng Zhou, Hejia Geng, Xiangyuan Xue, Li Kang, Yiran Qin, Zhiyong Wang, Zhenfei Yin, Lei Bai

TL;DR

This work tackles the scalability and optimization challenges of reasoning with large language models by introducing ReSo, a reward-driven, self-organizing multi-agent system. It combines task graph generation with a dynamic agent selection process guided by a Collaborative Reward Model (CRM) and an automated data-synthesis framework to create MAS benchmarks without human annotations. Key innovations include a Dynamic Agent Database, a two-stage agent search (coarse via UCB and fine-grained via CRM), and an MCTS-inspired perspective to efficiently navigate task graphs. Empirical results show ReSo matching or surpassing state-of-the-art methods on Math-MAS and SciBench-MAS, with strong generalization on standard benchmarks, and thorough ablations confirming the value of task decomposition, agent selection, and reward signaling. The approach demonstrates scalable, data-driven optimization of MAS cooperation, with open-source code and data to enable broader adoption and cross-domain application.

Abstract

Multi-agent systems (MAS) have emerged as a promising approach for enhancing the reasoning capabilities of large language models in complex problem-solving; however, current MAS frameworks suffer from poor flexibility and scalability with underdeveloped optimization strategies. To address these challenges, we propose ReSo, which integrates task graph generation with a reward-driven two-stage agent selection process centered on our Collaborative Reward Model that provides fine-grained reward signals to optimize MAS cooperation. We also introduce an automated data synthesis framework for generating MAS benchmarks without any human annotations. Experimental results show that ReSo matches or outperforms existing methods, achieving 33.7 percent accuracy on Math-MAS and 32.3 percent accuracy on SciBench-MAS, where other approaches completely fail.

ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks

TL;DR

This work tackles the scalability and optimization challenges of reasoning with large language models by introducing ReSo, a reward-driven, self-organizing multi-agent system. It combines task graph generation with a dynamic agent selection process guided by a Collaborative Reward Model (CRM) and an automated data-synthesis framework to create MAS benchmarks without human annotations. Key innovations include a Dynamic Agent Database, a two-stage agent search (coarse via UCB and fine-grained via CRM), and an MCTS-inspired perspective to efficiently navigate task graphs. Empirical results show ReSo matching or surpassing state-of-the-art methods on Math-MAS and SciBench-MAS, with strong generalization on standard benchmarks, and thorough ablations confirming the value of task decomposition, agent selection, and reward signaling. The approach demonstrates scalable, data-driven optimization of MAS cooperation, with open-source code and data to enable broader adoption and cross-domain application.

Abstract

Multi-agent systems (MAS) have emerged as a promising approach for enhancing the reasoning capabilities of large language models in complex problem-solving; however, current MAS frameworks suffer from poor flexibility and scalability with underdeveloped optimization strategies. To address these challenges, we propose ReSo, which integrates task graph generation with a reward-driven two-stage agent selection process centered on our Collaborative Reward Model that provides fine-grained reward signals to optimize MAS cooperation. We also introduce an automated data synthesis framework for generating MAS benchmarks without any human annotations. Experimental results show that ReSo matches or outperforms existing methods, achieving 33.7 percent accuracy on Math-MAS and 32.3 percent accuracy on SciBench-MAS, where other approaches completely fail.

Paper Structure

This paper contains 54 sections, 10 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Overview of ReSo pipeline. ReSo first decomposes the task into a DAG; and then constructs an agent graph by topological sorting. First, it searches for agent candidates for each subtask node from the dynamic agent database (DADB). Then it leverages the Collaborative Reward Model (CRM) to choose the best agent and update the agent estimation in DADB.
  • Figure 2: Illustration of our proposed ReSo. (a) We decompose the question into a subtask DAG. (b) The training of ReSo: we first use the UCB score to perform a coarse search in DADB and select top-k agents, then score the inference results using CRM, and update DADB by rewards. Repeat the above process for each node in DAG by topological order. (c) The testing of ReSo: we select the best agent from DADB.
  • Figure 3: Results of ablation studies. (a) Fine-tuning on domain-specific training data can significantly improve the decomposition quality, thus enhancing overall system performance. (b) Our robust agent selection strategy within the MAS is significant to the performance. (c) Compared to general reward models, our fine-tuned reward model is more task-specific and brings more precise reward signals, thus improving the system performance.
  • Figure 4: Training Curve of ReSo.
  • Figure 5: Performance of different models on our selected Math and SciBench dataset subproblems.
  • ...and 7 more figures