Table of Contents
Fetching ...

Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

Vighnesh Subramaniam, Yilun Du, Joshua B. Tenenbaum, Antonio Torralba, Shuang Li, Igor Mordatch

TL;DR

<3-5 sentence high-level summary> The paper addresses the plateau observed in single-agent self-improvement of LLMs by introducing a multiagent finetuning framework. It trains a society of models starting from the same base, assigning generation and critic roles, and uses independent data subsets derived from multiagent debates to foster specialization and diverse reasoning. Empirical results on arithmetic, GSM, and MATH show consistent gains over baselines across both open-source and proprietary models, with improvements persisting over multiple finetuning iterations and transferring zero-shot to new datasets. The approach offers a scalable path to autonomous model improvement that leverages diversity of reasoning styles and robust feedback across agents, at the cost of higher compute but with broad applicability to existing LLMs.

Abstract

Large language models (LLMs) have achieved remarkable performance in recent years but are fundamentally limited by the underlying training data. To improve models beyond the training data, recent works have explored how LLMs can be used to generate synthetic data for autonomous self-improvement. However, successive steps of self-improvement can reach a point of diminishing returns. In this work, we propose a complementary approach towards self-improvement where finetuning is applied to a multiagent society of language models. A group of language models, all starting from the same base model, are independently specialized by updating each one using data generated through multiagent interactions among the models. By training each model on independent sets of data, we illustrate how this approach enables specialization across models and diversification over the set of models. As a result, our overall system is able to preserve diverse reasoning chains and autonomously improve over many more rounds of fine-tuning than single-agent self-improvement methods. We quantitatively illustrate the efficacy of the approach across a wide suite of reasoning tasks.

Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

TL;DR

<3-5 sentence high-level summary> The paper addresses the plateau observed in single-agent self-improvement of LLMs by introducing a multiagent finetuning framework. It trains a society of models starting from the same base, assigning generation and critic roles, and uses independent data subsets derived from multiagent debates to foster specialization and diverse reasoning. Empirical results on arithmetic, GSM, and MATH show consistent gains over baselines across both open-source and proprietary models, with improvements persisting over multiple finetuning iterations and transferring zero-shot to new datasets. The approach offers a scalable path to autonomous model improvement that leverages diversity of reasoning styles and robust feedback across agents, at the cost of higher compute but with broad applicability to existing LLMs.

Abstract

Large language models (LLMs) have achieved remarkable performance in recent years but are fundamentally limited by the underlying training data. To improve models beyond the training data, recent works have explored how LLMs can be used to generate synthetic data for autonomous self-improvement. However, successive steps of self-improvement can reach a point of diminishing returns. In this work, we propose a complementary approach towards self-improvement where finetuning is applied to a multiagent society of language models. A group of language models, all starting from the same base model, are independently specialized by updating each one using data generated through multiagent interactions among the models. By training each model on independent sets of data, we illustrate how this approach enables specialization across models and diversification over the set of models. As a result, our overall system is able to preserve diverse reasoning chains and autonomously improve over many more rounds of fine-tuning than single-agent self-improvement methods. We quantitatively illustrate the efficacy of the approach across a wide suite of reasoning tasks.
Paper Structure (38 sections, 5 equations, 14 figures, 7 tables, 2 algorithms)

This paper contains 38 sections, 5 equations, 14 figures, 7 tables, 2 algorithms.

Figures (14)

  • Figure 1: Multiagent finetuning improves reasoning performance over multiple rounds of finetuning. Our multiagent finetuning procedure enables models to improve across multiple iterations of finetuing. Results reported on the MATH dataset.
  • Figure 2: Overview of Multiagent Finetuning.We first use multiagent debate and majority voting to create the finetuning datasets (left). These datasets are then used to finetune the generation and critic agents (right). When finetuning generation models, we use the majority voted result ("correct" output) to select first-round responses from each agent. We then finetune critic models using responses from the final round based on whether responses match the majority voted result (mix of "correct and incorrect" outputs). The finetuned models are combined through multiagent debate to generate more accurate answers. In this figure, we illustrate a single finetuning iteration. Applying multiple rounds of finetuning iterations can significantly boost performance.
  • Figure 3: Diversity is preserved and can improve across iterations of finetuning. We measure the response diversity of our method and the single-agent finetuning method on the MATH dataset using two diversity measures. The diversity of our method remains consistent over finetuning iterations for one metric and improves for another metric, whereas the diversity of the single-agent method drops significantly.
  • Figure 4: Relationship between accuracy and diversity. We visualize the relationship between embedding dissimilarity and MATH accuracy across rounds of finetuning. Our multiagent finetuning preserves diversity across rounds of finetuning while improving accuracy.
  • Figure 5: Zero-shot generalization of the proposed method. Our method demonstrates zero-shot generalization capabilities. When trained on the MATH dataset, it can effectively generalize to the GSM dataset. It outperforms all the baselines that are trained on the GSM dataset.
  • ...and 9 more figures