Table of Contents
Fetching ...

Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning

Hao Ma, Tianyi Hu, Zhiqiang Pu, Boyin Liu, Xiaolin Ai, Yanyan Liang, Min Chen

TL;DR

This work addresses instability and distribution collapse in RL fine-tuning of large language models by reframing the process as sequential cooperative multi-agent reinforcement learning (MARL). It introduces CORY, which duplicates the LLM into a pioneer and an observer that coevolve via knowledge transfer and periodic role exchanges, sharing a combined reward and remaining compatible with PPO. Empirical results on IMDB and GSM8K show improved stability and competitive task performance, with a clearer Pareto frontier between reward and policy drift than single-agent RL. The approach is presented as a flexible, algorithm-agnostic plug-and-play solution that can mitigate typical RLHF challenges in real-world fine-tuning.

Abstract

Reinforcement learning (RL) has emerged as a pivotal technique for fine-tuning large language models (LLMs) on specific tasks. However, prevailing RL fine-tuning methods predominantly rely on PPO and its variants. Though these algorithms are effective in general RL settings, they often exhibit suboptimal performance and vulnerability to distribution collapse when applied to the fine-tuning of LLMs. In this paper, we propose CORY, extending the RL fine-tuning of LLMs to a sequential cooperative multi-agent reinforcement learning framework, to leverage the inherent coevolution and emergent capabilities of multi-agent systems. In CORY, the LLM to be fine-tuned is initially duplicated into two autonomous agents: a pioneer and an observer. The pioneer generates responses based on queries, while the observer generates responses using both the queries and the pioneer's responses. The two agents are trained together. During training, the agents exchange roles periodically, fostering cooperation and coevolution between them. Experiments evaluate CORY's performance by fine-tuning GPT-2 and Llama-2 under subjective and objective reward functions on the IMDB Review and GSM8K datasets, respectively. Results show that CORY outperforms PPO in terms of policy optimality, resistance to distribution collapse, and training robustness, thereby underscoring its potential as a superior methodology for refining LLMs in real-world applications.

Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning

TL;DR

This work addresses instability and distribution collapse in RL fine-tuning of large language models by reframing the process as sequential cooperative multi-agent reinforcement learning (MARL). It introduces CORY, which duplicates the LLM into a pioneer and an observer that coevolve via knowledge transfer and periodic role exchanges, sharing a combined reward and remaining compatible with PPO. Empirical results on IMDB and GSM8K show improved stability and competitive task performance, with a clearer Pareto frontier between reward and policy drift than single-agent RL. The approach is presented as a flexible, algorithm-agnostic plug-and-play solution that can mitigate typical RLHF challenges in real-world fine-tuning.

Abstract

Reinforcement learning (RL) has emerged as a pivotal technique for fine-tuning large language models (LLMs) on specific tasks. However, prevailing RL fine-tuning methods predominantly rely on PPO and its variants. Though these algorithms are effective in general RL settings, they often exhibit suboptimal performance and vulnerability to distribution collapse when applied to the fine-tuning of LLMs. In this paper, we propose CORY, extending the RL fine-tuning of LLMs to a sequential cooperative multi-agent reinforcement learning framework, to leverage the inherent coevolution and emergent capabilities of multi-agent systems. In CORY, the LLM to be fine-tuned is initially duplicated into two autonomous agents: a pioneer and an observer. The pioneer generates responses based on queries, while the observer generates responses using both the queries and the pioneer's responses. The two agents are trained together. During training, the agents exchange roles periodically, fostering cooperation and coevolution between them. Experiments evaluate CORY's performance by fine-tuning GPT-2 and Llama-2 under subjective and objective reward functions on the IMDB Review and GSM8K datasets, respectively. Results show that CORY outperforms PPO in terms of policy optimality, resistance to distribution collapse, and training robustness, thereby underscoring its potential as a superior methodology for refining LLMs in real-world applications.
Paper Structure (27 sections, 14 equations, 12 figures, 5 tables, 2 algorithms)

This paper contains 27 sections, 14 equations, 12 figures, 5 tables, 2 algorithms.

Figures (12)

  • Figure 1: The framework of CORY. A traditional RL fine-tuning method can be simply extended to the CORY version with only three steps. First, duplicate the LLM into two LLM agents, one acting as a pioneer and the other as an observer; second, combine the task rewards of the two LLM agents to replace the original task reward; third, periodically exchange the roles of the two LLM agents during training. After training, either the LLM agent can perform the task independently.
  • Figure 2: The empirical demonstration of why CORY surpasses single-agent RL fine-tuning. In (c), the values of $\eta$ from left to right are 1e-5, 1e-4, 1e-3, and 1e-2.
  • Figure 3: Training curves under subjective rewards on IMDB Review.
  • Figure 4: Training curves under objective rewards on GSM8K.
  • Figure 5: Evaluation results on GSM8K test dataset.
  • ...and 7 more figures