Stay Focused: Problem Drift in Multi-Agent Debate
Jonas Becker, Lars Benedikt Kaesberg, Andreas Stephan, Jan Philip Wahle, Terry Ruas, Bela Gipp
TL;DR
Stay Focused analyzes problem drift in multi-agent debate (MAD), where turn-based discussions among LLMs progressively stray from the initial task and degrade performance. The authors formalize drift with the per-turn metric $FOCUS_r = P(\hat{y}^{(r)}, y) - P(\hat{y}^{(r-1)}, y)$ and define recovery as eventual regain of prior performance, using $FOCUS_{1,M} = \sum_{r=1}^{M} FOCUS_r$. They propose DRIFTJudge for test-time drift detection and DRIFTPolicy (with a policy feedback agent) to mitigate drift, evaluating on ten tasks across generative, knowledge, reasoning, and instruction-following domains. Results show drift is widespread, particularly in generative tasks (up to ~89%), with limited natural recovery (often <50%), and that DRIFTPolicy can reduce drift and improve task accuracy by up to 3.6% for weaker model agents. The work provides eight human-annotated error types (temporal and local) and releases DRIFTEval, outlining a first baseline for detecting and mitigating problem drift in MAD and outlining avenues for future refinement and efficiency improvements.
Abstract
Multi-agent debate - multiple instances of large language models discussing problems in turn-based interaction - has shown promise for solving knowledge and reasoning tasks. However, these methods show limitations when solving complex problems that require longer reasoning chains. We analyze how multi-agent debate over multiple turns drifts away from the initial problem, thus harming task performance. We define this phenomenon as problem drift and quantify its presence across ten tasks (i.e., three generative, three knowledge, three reasoning, and one instruction-following task). To identify the reasons for this issue, eight human experts analyze 170 multi-agent discussions suffering from problem drift. We find the most common issues related to this drift are the lack of progress (35% of cases), low-quality feedback (26% of cases), and a lack of clarity (25% of cases). To address problem drift, we propose DRIFTJudge, an LLM-as-a-judge method, to detect problem drift at test-time. We also propose DRIFTPolicy, a method that mitigates problem drift cases to improve task performance. Our study is a step toward understanding a key limitation of multi-agent debate, highlighting why longer debates can harm task performance and how problem drift could be addressed.
