Table of Contents
Fetching ...

Multi-Agent Evolve: LLM Self-Improve through Co-evolution

Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Patwary, Jiaxuan You

TL;DR

MAE presents a three-role self-evolving RL framework for LLMs by instantiating a Proposer, Solver, and Judge from a single backbone model. It introduces domain-agnostic, self-rewarding signals viaJudge-based evaluation, difficulty-aware rewards, and format constraints, trained with Task-Relative REINFORCE++ and synchronous updates. Empirical results on Qwen2.5-3B-Instruct show consistent gains over base and SFT baselines across math, coding, reasoning, and general knowledge tasks, with additional gains when using seed reference questions. The work demonstrates scalable, data-efficient self-improvement without human supervision and points to future extensions with larger backbones and verifiable environments for broader general-domain evolution.

Abstract

Reinforcement Learning (RL) has demonstrated significant potential in enhancing the reasoning capabilities of large language models (LLMs). However, the success of RL for LLMs heavily relies on human-curated datasets and verifiable rewards, which limit their scalability and generality. Recent Self-Play RL methods, inspired by the success of the paradigm in games and Go, aim to enhance LLM reasoning capabilities without human-annotated data. However, their methods primarily depend on a grounded environment for feedback (e.g., a Python interpreter or a game engine); extending them to general domains remains challenging. To address these challenges, we propose Multi-Agent Evolve (MAE), a framework that enables LLMs to self-evolve in solving diverse tasks, including mathematics, reasoning, and general knowledge Q&A. The core design of MAE is based on a triplet of interacting agents (Proposer, Solver, Judge) that are instantiated from a single LLM, and applies reinforcement learning to optimize their behaviors. The Proposer generates questions, the Solver attempts solutions, and the Judge evaluates both while co-evolving. Experiments on Qwen2.5-3B-Instruct demonstrate that MAE achieves an average improvement of 4.54% on multiple benchmarks. These results highlight MAE as a scalable, data-efficient method for enhancing the general reasoning abilities of LLMs with minimal reliance on human-curated supervision.

Multi-Agent Evolve: LLM Self-Improve through Co-evolution

TL;DR

MAE presents a three-role self-evolving RL framework for LLMs by instantiating a Proposer, Solver, and Judge from a single backbone model. It introduces domain-agnostic, self-rewarding signals viaJudge-based evaluation, difficulty-aware rewards, and format constraints, trained with Task-Relative REINFORCE++ and synchronous updates. Empirical results on Qwen2.5-3B-Instruct show consistent gains over base and SFT baselines across math, coding, reasoning, and general knowledge tasks, with additional gains when using seed reference questions. The work demonstrates scalable, data-efficient self-improvement without human supervision and points to future extensions with larger backbones and verifiable environments for broader general-domain evolution.

Abstract

Reinforcement Learning (RL) has demonstrated significant potential in enhancing the reasoning capabilities of large language models (LLMs). However, the success of RL for LLMs heavily relies on human-curated datasets and verifiable rewards, which limit their scalability and generality. Recent Self-Play RL methods, inspired by the success of the paradigm in games and Go, aim to enhance LLM reasoning capabilities without human-annotated data. However, their methods primarily depend on a grounded environment for feedback (e.g., a Python interpreter or a game engine); extending them to general domains remains challenging. To address these challenges, we propose Multi-Agent Evolve (MAE), a framework that enables LLMs to self-evolve in solving diverse tasks, including mathematics, reasoning, and general knowledge Q&A. The core design of MAE is based on a triplet of interacting agents (Proposer, Solver, Judge) that are instantiated from a single LLM, and applies reinforcement learning to optimize their behaviors. The Proposer generates questions, the Solver attempts solutions, and the Judge evaluates both while co-evolving. Experiments on Qwen2.5-3B-Instruct demonstrate that MAE achieves an average improvement of 4.54% on multiple benchmarks. These results highlight MAE as a scalable, data-efficient method for enhancing the general reasoning abilities of LLMs with minimal reliance on human-curated supervision.

Paper Structure

This paper contains 48 sections, 11 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of the Multi-Agent Evolve Framework. Multi-Agent Evolve instantiates three interactive roles (Proposer, Solver, and Judge) from a single LLM to form a closed self-improving loop. The Proposer generates new questions, the Solver attempts to answer them, and the Judge evaluates both to provide general-domain reward signals. The Judge rewards the Solver for accurate reasoning, while the Proposer receives both a quality reward from the Judge and a difficulty reward that increases when the Solver fails, creating an adversarial co-evolution process that continuously enhances the model’s reasoning ability.
  • Figure 2: Multi-Agent Evolve Framework: (Upper) Multi-Agent Evolve uses the backbone LLM itself as a general evaluator for questions and answers. This brings several benefits, including adaptability for general tasks and increased interactions between agents. (Lower Left) Our framework adapts the quality filtering technique to the Proposer's generation loop, preventing degradation in dataset quality during prolonged training. (Lower Right) Our multi-agent training employs Task-Relative REINFORCE++, which calculates advantage for each role respectively and then performs synchronized parameter update to the uniform model.
  • Figure 3: Training Process Analysis: These three figures demonstrate an example training process. (Left) The number of questions in the dataset increases steadily while low-quality questions are excluded. (Mid and Right) The Proposer learns to generate questions that present a desirable level of difficulty to the Solver, thereby benefiting the model in future training.
  • Figure 4: Examples of Applying Format Reward and Question Quality Filtering Examples shown in green demonstrate the generation that can be correctly extracted when these two techniques are applied, which helps the maintenance of our dataset and the training process. The red examples show typical errors that detract from the training process by introducing incorrect questions or by frequently causing the reward to fall back to a neutral value.