Table of Contents
Fetching ...

CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards

Xiangyuan Xue, Yifan Zhou, Guibin Zhang, Zaibin Zhang, Yijiang Li, Chen Zhang, Zhenfei Yin, Philip Torr, Wanli Ouyang, Lei Bai

TL;DR

CoMAS tackles autonomous self-evolution of LLM-based agents by enabling learning purely from inter-agent interactions, without external rewards. It introduces three components—interaction-generated data, an LLM-as-judge intrinsic reward mechanism, and RL-based policy updates using REINFORCE++—to drive decentralized co-evolution. Across math, coding, and science benchmarks, CoMAS achieves consistent gains and state-of-the-art performance in several settings, with ablations confirming the necessity of interaction-based rewards and scalability with more diverse agents. The framework points to a scalable, decentralized path for autonomous multi-agent learning that mirrors human collaborative problem solving while reducing reliance on external verifiers or reward models.

Abstract

Self-evolution is a central research topic in enabling large language model (LLM)-based agents to continually improve their capabilities after pretraining. Recent research has witnessed a transition from reinforcement learning (RL)-free to RL-based methods. Current RL-based methods either rely on dense external reward signals or extract intrinsic reward signals from LLMs themselves. However, these approaches diverge from the self-evolution mechanisms observed in human intelligence, where individuals learn and improve through mutual discussion and collaboration. In this work, we introduce Co-Evolving Multi-Agent Systems (CoMAS), a novel framework that enables agents to improve autonomously by learning from inter-agent interactions without external supervision. CoMAS generates intrinsic rewards from rich discussion dynamics, employs an LLM-as-a-judge mechanism to formulate these rewards, and optimizes each agent's policy through RL, thereby enabling decentralized and scalable co-evolution. Experimental results demonstrate that CoMAS consistently outperforms untrained agents and achieves state-of-the-art performance across most evaluation settings. Ablation studies confirm the necessity of interaction-based reward signals and reveal promising scalability as the number and diversity of agents increase. These findings establish CoMAS as a novel and effective paradigm for self-evolution in LLM-based agents.

CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards

TL;DR

CoMAS tackles autonomous self-evolution of LLM-based agents by enabling learning purely from inter-agent interactions, without external rewards. It introduces three components—interaction-generated data, an LLM-as-judge intrinsic reward mechanism, and RL-based policy updates using REINFORCE++—to drive decentralized co-evolution. Across math, coding, and science benchmarks, CoMAS achieves consistent gains and state-of-the-art performance in several settings, with ablations confirming the necessity of interaction-based rewards and scalability with more diverse agents. The framework points to a scalable, decentralized path for autonomous multi-agent learning that mirrors human collaborative problem solving while reducing reliance on external verifiers or reward models.

Abstract

Self-evolution is a central research topic in enabling large language model (LLM)-based agents to continually improve their capabilities after pretraining. Recent research has witnessed a transition from reinforcement learning (RL)-free to RL-based methods. Current RL-based methods either rely on dense external reward signals or extract intrinsic reward signals from LLMs themselves. However, these approaches diverge from the self-evolution mechanisms observed in human intelligence, where individuals learn and improve through mutual discussion and collaboration. In this work, we introduce Co-Evolving Multi-Agent Systems (CoMAS), a novel framework that enables agents to improve autonomously by learning from inter-agent interactions without external supervision. CoMAS generates intrinsic rewards from rich discussion dynamics, employs an LLM-as-a-judge mechanism to formulate these rewards, and optimizes each agent's policy through RL, thereby enabling decentralized and scalable co-evolution. Experimental results demonstrate that CoMAS consistently outperforms untrained agents and achieves state-of-the-art performance across most evaluation settings. Ablation studies confirm the necessity of interaction-based reward signals and reveal promising scalability as the number and diversity of agents increase. These findings establish CoMAS as a novel and effective paradigm for self-evolution in LLM-based agents.

Paper Structure

This paper contains 25 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: A comparison of our proposed CoMAS framework with existing RL-based self-evolution methods. The left column outlines methods utilizing external rewards from verifiers or reward models. The middle column outlines methods leveraging intrinsic rewards from metrics such as self-certainty, confidence, semantic entropy, and pseudo-labels from majority voting. The right column outlines our CoMAS framework, which derives rewards from multi-agent interactions.
  • Figure 2: An overview of our proposed CoMAS pipeline. CoMAS is built upon a flexible and interactive multi-agent workflow, composed of three core components: interaction, reward formulation, and policy optimization. For a given question, the agents conduct a discussion through contributing solutions, evaluating the existing solutions, and scoring solutions based on their evaluations. The scores will be extracted and transformed into rewards for the corresponding solutions and evaluations. All the generated experiences will be collected to train the policies of the agents.
  • Figure 3: Training dynamics of CoMAS. The left figure shows the curve of the average response length of each agent during training. The right figure shows the curve of the average normalized reward of each agent during training. These trends together indicate that CoMAS achieves a stable and effective training process that improves the capabilities of agents.
  • Figure 4: Results of the ablation study for reward formulation. The left figure compares the performance across the original CoMAS and two variants (without evaluation and without scoring). The right figure shows the average normalized rewards during the training process. These results indicate that the adversarial reward design is of key importance for the success of CoMAS.
  • Figure 5: Results of the ablation study for framework scalability. The left figure shows how the number of agents affects the performance of CoMAS across different setups. The right figure compares the performance between homogeneous and heterogeneous agent settings. These results demonstrate the underlying scalability of CoMAS with the number and diversity of agents.