Table of Contents
Fetching ...

Stay Focused: Problem Drift in Multi-Agent Debate

Jonas Becker, Lars Benedikt Kaesberg, Andreas Stephan, Jan Philip Wahle, Terry Ruas, Bela Gipp

TL;DR

Stay Focused analyzes problem drift in multi-agent debate (MAD), where turn-based discussions among LLMs progressively stray from the initial task and degrade performance. The authors formalize drift with the per-turn metric $FOCUS_r = P(\hat{y}^{(r)}, y) - P(\hat{y}^{(r-1)}, y)$ and define recovery as eventual regain of prior performance, using $FOCUS_{1,M} = \sum_{r=1}^{M} FOCUS_r$. They propose DRIFTJudge for test-time drift detection and DRIFTPolicy (with a policy feedback agent) to mitigate drift, evaluating on ten tasks across generative, knowledge, reasoning, and instruction-following domains. Results show drift is widespread, particularly in generative tasks (up to ~89%), with limited natural recovery (often <50%), and that DRIFTPolicy can reduce drift and improve task accuracy by up to 3.6% for weaker model agents. The work provides eight human-annotated error types (temporal and local) and releases DRIFTEval, outlining a first baseline for detecting and mitigating problem drift in MAD and outlining avenues for future refinement and efficiency improvements.

Abstract

Multi-agent debate - multiple instances of large language models discussing problems in turn-based interaction - has shown promise for solving knowledge and reasoning tasks. However, these methods show limitations when solving complex problems that require longer reasoning chains. We analyze how multi-agent debate over multiple turns drifts away from the initial problem, thus harming task performance. We define this phenomenon as problem drift and quantify its presence across ten tasks (i.e., three generative, three knowledge, three reasoning, and one instruction-following task). To identify the reasons for this issue, eight human experts analyze 170 multi-agent discussions suffering from problem drift. We find the most common issues related to this drift are the lack of progress (35% of cases), low-quality feedback (26% of cases), and a lack of clarity (25% of cases). To address problem drift, we propose DRIFTJudge, an LLM-as-a-judge method, to detect problem drift at test-time. We also propose DRIFTPolicy, a method that mitigates problem drift cases to improve task performance. Our study is a step toward understanding a key limitation of multi-agent debate, highlighting why longer debates can harm task performance and how problem drift could be addressed.

Stay Focused: Problem Drift in Multi-Agent Debate

TL;DR

Stay Focused analyzes problem drift in multi-agent debate (MAD), where turn-based discussions among LLMs progressively stray from the initial task and degrade performance. The authors formalize drift with the per-turn metric and define recovery as eventual regain of prior performance, using . They propose DRIFTJudge for test-time drift detection and DRIFTPolicy (with a policy feedback agent) to mitigate drift, evaluating on ten tasks across generative, knowledge, reasoning, and instruction-following domains. Results show drift is widespread, particularly in generative tasks (up to ~89%), with limited natural recovery (often <50%), and that DRIFTPolicy can reduce drift and improve task accuracy by up to 3.6% for weaker model agents. The work provides eight human-annotated error types (temporal and local) and releases DRIFTEval, outlining a first baseline for detecting and mitigating problem drift in MAD and outlining avenues for future refinement and efficiency improvements.

Abstract

Multi-agent debate - multiple instances of large language models discussing problems in turn-based interaction - has shown promise for solving knowledge and reasoning tasks. However, these methods show limitations when solving complex problems that require longer reasoning chains. We analyze how multi-agent debate over multiple turns drifts away from the initial problem, thus harming task performance. We define this phenomenon as problem drift and quantify its presence across ten tasks (i.e., three generative, three knowledge, three reasoning, and one instruction-following task). To identify the reasons for this issue, eight human experts analyze 170 multi-agent discussions suffering from problem drift. We find the most common issues related to this drift are the lack of progress (35% of cases), low-quality feedback (26% of cases), and a lack of clarity (25% of cases). To address problem drift, we propose DRIFTJudge, an LLM-as-a-judge method, to detect problem drift at test-time. We also propose DRIFTPolicy, a method that mitigates problem drift cases to improve task performance. Our study is a step toward understanding a key limitation of multi-agent debate, highlighting why longer debates can harm task performance and how problem drift could be addressed.

Paper Structure

This paper contains 44 sections, 4 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Problem drift in MAD. DRIFTJudge detects problem drift at test-time. DRIFTPolicy provides on-demand feedback about the conversation.
  • Figure 2: Example of problem drift in MAD. The English instructor induces a logical error in the discussion. The other agents agree without skepticism, leading to the wrong solution and problem drift.
  • Figure 3: Prompt to an agent that contributes to the discussion. If this is the first message of the discussion, we write "Nobody proposed a solution yet. Please provide the first one." instead of the most recent draft and agent memory.
  • Figure 4: Prompt to extract the solution from an agent's answer.
  • Figure 5: Prompt to process the voting at the end of each turn. The number of solutions to vote for can vary depending on the proposals made by the agents.
  • ...and 5 more figures