Table of Contents
Fetching ...

Training Language Models to Critique With Multi-agent Feedback

Tian Lan, Wenwei Zhang, Chengqi Lyu, Shuaibin Li, Chen Xu, Heyan Huang, Dahua Lin, Xian-Ling Mao, Kai Chen

TL;DR

The paper introduces MultiCritique, a multi-agent critique data-generation and RL framework to enhance LLM critique ability without human annotations. By aggregating critiques from four agents, classifying and summarizing via GPT-4, and applying MARS filtering for RL, the authors construct two datasets (MultiCritiqueDataset-SFT and MultiCritiqueDataset-RL) that significantly improve critique quality over existing datasets. Fine-tuning a 7B model with MultiCritique yields substantial gains, approaching the performance of much larger models like 70B-sized LLMs and GPT-4 on critique benchmarks. The work demonstrates the scalability and effectiveness of automated, multi-agent critique pipelines for reliable evaluation and improvement of LLMs, with open-source weights and datasets to follow. Ethical considerations and future directions include extending to pairwise comparisons and incorporating more diverse tasks and models.

Abstract

Critique ability, a meta-cognitive capability of humans, presents significant challenges for LLMs to improve. Recent works primarily rely on supervised fine-tuning (SFT) using critiques generated by a single LLM like GPT-4. However, these model-generated critiques often exhibit flaws due to the inherent complexity of the critique. Consequently, fine-tuning LLMs on such flawed critiques typically limits the model's performance and propagates these flaws into the learned model. To overcome these challenges, this paper proposes a novel data generation pipeline, named MultiCritique, that improves the critique ability of LLMs by utilizing multi-agent feedback in both the SFT and reinforcement learning (RL) stages. First, our data generation pipeline aggregates high-quality critiques from multiple agents instead of a single model, with crucial information as input for simplifying the critique. Furthermore, our pipeline improves the preference accuracy of critique quality through multi-agent feedback, facilitating the effectiveness of RL in improving the critique ability of LLMs. Based on our proposed MultiCritique data generation pipeline, we construct the MultiCritiqueDataset for the SFT and RL fine-tuning stages. Extensive experimental results on two benchmarks demonstrate: 1) the superior quality of our constructed SFT dataset compared to existing critique datasets; 2) additional improvements to the critique ability of LLMs brought by the RL stage. Notably, our fine-tuned 7B model significantly surpasses other advanced 7B-13B open-source models, approaching the performance of advanced 70B LLMs and GPT-4. Codes, datasets and model weights will be publicly available.

Training Language Models to Critique With Multi-agent Feedback

TL;DR

The paper introduces MultiCritique, a multi-agent critique data-generation and RL framework to enhance LLM critique ability without human annotations. By aggregating critiques from four agents, classifying and summarizing via GPT-4, and applying MARS filtering for RL, the authors construct two datasets (MultiCritiqueDataset-SFT and MultiCritiqueDataset-RL) that significantly improve critique quality over existing datasets. Fine-tuning a 7B model with MultiCritique yields substantial gains, approaching the performance of much larger models like 70B-sized LLMs and GPT-4 on critique benchmarks. The work demonstrates the scalability and effectiveness of automated, multi-agent critique pipelines for reliable evaluation and improvement of LLMs, with open-source weights and datasets to follow. Ethical considerations and future directions include extending to pairwise comparisons and incorporating more diverse tasks and models.

Abstract

Critique ability, a meta-cognitive capability of humans, presents significant challenges for LLMs to improve. Recent works primarily rely on supervised fine-tuning (SFT) using critiques generated by a single LLM like GPT-4. However, these model-generated critiques often exhibit flaws due to the inherent complexity of the critique. Consequently, fine-tuning LLMs on such flawed critiques typically limits the model's performance and propagates these flaws into the learned model. To overcome these challenges, this paper proposes a novel data generation pipeline, named MultiCritique, that improves the critique ability of LLMs by utilizing multi-agent feedback in both the SFT and reinforcement learning (RL) stages. First, our data generation pipeline aggregates high-quality critiques from multiple agents instead of a single model, with crucial information as input for simplifying the critique. Furthermore, our pipeline improves the preference accuracy of critique quality through multi-agent feedback, facilitating the effectiveness of RL in improving the critique ability of LLMs. Based on our proposed MultiCritique data generation pipeline, we construct the MultiCritiqueDataset for the SFT and RL fine-tuning stages. Extensive experimental results on two benchmarks demonstrate: 1) the superior quality of our constructed SFT dataset compared to existing critique datasets; 2) additional improvements to the critique ability of LLMs brought by the RL stage. Notably, our fine-tuned 7B model significantly surpasses other advanced 7B-13B open-source models, approaching the performance of advanced 70B LLMs and GPT-4. Codes, datasets and model weights will be publicly available.

Paper Structure

This paper contains 63 sections, 3 equations, 10 figures, 14 tables.

Figures (10)

  • Figure 1: The overview of our proposed MultiCritique data generation pipeline. First, we prepare queries and evaluate responses and crucial information (Step 1). Then, we conduct the MultiCritique-SFT pipeline to construct the high-quality SFT critique dataset (Step 2). Finally, we conduct the MultiCritique-RL pipeline to construct the preference critique dataset for the RL stage (Step 3). An ACU is a structured unit for identifying one specific flaw in the evaluated response. A list of model-generated ACUs denotes the analytical critique.
  • Figure 2: The correlation between the number of training samples in the SFT dataset (from 1K to 256K) and critique ability. low, medium, high and full represent the models that are trained on critiques in MultiCritiqueDataset-SFT for low-, medium-, high-quality, and all three response qualities (full), respectively.
  • Figure 3: The prompt for generating task description about the last user query in conversation.
  • Figure 4: The prompt for generating reference response given the criteria.
  • Figure 5: The prompt for generating meta-critiques for all the critiques generated by multiple LLMs.
  • ...and 5 more figures