DRAFT-RL: Multi-Agent Chain-of-Draft Reasoning for Reinforcement Learning-Enhanced LLMs
Yuanhao Li, Mingshan Liu, Hongbo Wang, Yiding Zhang, Yifei Ma, Wei Tan
TL;DR
DRAFT-RL addresses reliance on single-path reasoning and slow convergence in LLM-based multi-agent RL by integrating Chain-of-Draft reasoning with a three-part mechanism: multi-draft generation, peer-guided evaluation, and reward-aligned selection. The framework enables explicit multi-path exploration, collaborative critique, and reinforcement learning-guided refinement, yielding stronger performance across code synthesis, symbolic mathematics, and knowledge-intensive QA, including notable gains on the MATH dataset. Key contributions include introducing CoD-style concise drafting in a multi-agent RL setting, a peer-evaluation mechanism, and a learned reward model that unifies exploration with task rewards, resulting in faster convergence and interpretable agent behavior. The approach demonstrates substantial practical impact by improving accuracy, efficiency, and robustness in complex reasoning tasks and offering insights into emergent agent specialization and cross-domain transfer.
Abstract
Large Language Models (LLMs) have shown impressive capabilities in multi-step reasoning and problem-solving.Recent works introduce multi-agent reflection frameworks where multiple LLM agents critique and refine each other's outputs using reinforcement learning (RL). However, these approaches often rely on single-shot responses and lack structural diversity in reasoning exploration. In this paper, we propose DRAFT-RL, a novel framework that integrates Chain-of-Draft (CoD) reasoning into multi-agent RL training. Instead of generating single responses, each agent produces multiple drafts per query, which are then evaluated by peer agents and a learned reward model to identify the most promising trajectory. These selected drafts are used to refine future reasoning strategies through actor-critic learning.DRAFT-RL enables explicit multi-path exploration, peer-guided reflection, and reward-aligned selection, resulting in more robust and interpretable LLM agent behavior. We evaluate our method on complex reasoning tasks including code synthesis, symbolic math, and knowledge-intensive QA,demonstrating that DRAFT-RL outperforms existing reflective and RL-based agents by significant margins in both accuracy and convergence speed
