Table of Contents
Fetching ...

ACC-Collab: An Actor-Critic Approach to Multi-Agent LLM Collaboration

Andrew Estornell, Jean-Francois Ton, Yuanshun Yao, Yang Liu

TL;DR

ACC-Collab introduces a learned two-agent framework where an actor and a critic are jointly trained to collaboratively solve tasks via iterative dialogue. The method leverages Partial Trajectory Rewards and Guided-Collaborative Trajectories with Direct Preference Optimization to create high-quality training data and robust collaboration strategies. Across BoolQ, MMLU, BBH, SCIQ, and ARC, ACC-Collab outperforms state-of-the-art baselines, often with a single round of training, while the critic evolves to provide more informative disagreement. These findings suggest that explicitly training collaboration between LLM agents can yield substantial improvements in multi-turn reasoning and task solving, with potential for broader applicability and future scaling.

Abstract

Large language models (LLMs) have demonstrated a remarkable ability to serve as general-purpose tools for various language-based tasks. Recent works have demonstrated that the efficacy of such models can be improved through iterative dialog between multiple models. While these paradigms show promise in improving model efficacy, most works in this area treat collaboration as an emergent behavior, rather than a learned behavior. In doing so, current multi-agent frameworks rely on collaborative behaviors to have been sufficiently trained into off-the-shelf models. To address this limitation, we propose ACC-Collab, an Actor-Critic based learning framework to produce a two-agent team (an actor-agent and a critic-agent) specialized in collaboration. We demonstrate that ACC-Collab outperforms SotA multi-agent techniques on a wide array of benchmarks.

ACC-Collab: An Actor-Critic Approach to Multi-Agent LLM Collaboration

TL;DR

ACC-Collab introduces a learned two-agent framework where an actor and a critic are jointly trained to collaboratively solve tasks via iterative dialogue. The method leverages Partial Trajectory Rewards and Guided-Collaborative Trajectories with Direct Preference Optimization to create high-quality training data and robust collaboration strategies. Across BoolQ, MMLU, BBH, SCIQ, and ARC, ACC-Collab outperforms state-of-the-art baselines, often with a single round of training, while the critic evolves to provide more informative disagreement. These findings suggest that explicitly training collaboration between LLM agents can yield substantial improvements in multi-turn reasoning and task solving, with potential for broader applicability and future scaling.

Abstract

Large language models (LLMs) have demonstrated a remarkable ability to serve as general-purpose tools for various language-based tasks. Recent works have demonstrated that the efficacy of such models can be improved through iterative dialog between multiple models. While these paradigms show promise in improving model efficacy, most works in this area treat collaboration as an emergent behavior, rather than a learned behavior. In doing so, current multi-agent frameworks rely on collaborative behaviors to have been sufficiently trained into off-the-shelf models. To address this limitation, we propose ACC-Collab, an Actor-Critic based learning framework to produce a two-agent team (an actor-agent and a critic-agent) specialized in collaboration. We demonstrate that ACC-Collab outperforms SotA multi-agent techniques on a wide array of benchmarks.

Paper Structure

This paper contains 36 sections, 9 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: ACC-Collab training pipeline, exemplified for the actor. 1) We generate data from both natural deliberation as well as guided deliberation towards and away from the ground truth answer $y$ using the actor and critic. 2) We compute the relative quality of each trajectory based on the expected quality difference $\Delta_{y}, \Delta_{!y}$ w.r.t. to the natural response. 3) We store all high-quality pairwise data in our database and train the actor agent. 4) We alternate this procedure for the actor and critic. See Figure \ref{['fig:main_method_both_agents']} of the supplement for the corresponding procedure applied to the critic.
  • Figure 2: Percent improvement in accuracy after five rounds of deliberation, compared to a single round. Percent improvement (Eq. \ref{['eq:improve']}) for each method is averaged across all five datasets.
  • Figure 3: Accuracy over five rounds of deliberation on BoolQ (top) and SCIQ (bottom).
  • Figure 4: Comparison of responses from the critic model before and after training with ACC-Collab.
  • Figure 5: ACC-Collab training pipeline, exemplified for the actor (top) and critic (bottom). The process 1) We generate data from both natural deliberation as well as guided deliberation towards and away from the ground truth answer $y$ using the actor and critic. 2) We compute the relative quality of each trajectory based on the expected quality difference $\Delta_{y}, \Delta_{!y}$ w.r.t. to the natural response. 3) We store all high-quality pairwise data in our database and train the actor model. 4) We alternate this procedure for the actor and critic. As outline in Section \ref{['sec:method']} both the guidance procedure and the computation of $\Delta_{y}, \Delta_{!y}$ differ between the actor and critic.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Remark 1
  • Remark 2