Evolving and Executing Research Plans via Double-Loop Multi-Agent Collaboration
Zhi Zhang, Yan Liu, Zhejing Hu, Gong Chen, Sheng-hua Zhong, Jiannong Cao
TL;DR
The paper tackles automated scientific research by formulating it as a bilevel optimization problem: \( \max_{p \in \mathcal{P}} R(p, y^*(p)) \) subject to \( y^*(p) \in \arg\max_{y \in \mathcal{Y}(p)} f(p,y) \). It introduces the Double-Loop Multi-Agent (DLMA) framework, where a leader loop of professor agents evolves a population of plans through involvement, improvement, and integration meetings, and a follower loop of doctoral agents executes the chosen plan with pre-hoc/post-hoc planning, contextual/external observations, and continual draft refinement. Extensive experiments on ACLAward and Laboratory show state-of-the-art automatic evaluation scores, with ablation studies confirming that both loops are essential: evolution drives novelty while execution ensures soundness. While DLMA advances automated scientific discovery by integrating literature review, experimentation, and drafting, it incurs significant computational costs and faces challenges like code-generation hallucinations, motivating future work on efficiency and reliability.
Abstract
Automating the end-to-end scientific research process poses a fundamental challenge: it requires both evolving high-level plans that are novel and sound, and executing these plans correctly amidst dynamic and uncertain conditions. To address this bilevel challenge, we propose a novel Double-Loop Multi-Agent (DLMA) framework to solve the given research problem automatically. The leader loop, composed of professor agents, is responsible for evolving research plans. It employs an evolutionary algorithm through involvement, improvement, and integration meetings to iteratively generate and refine a pool of research proposals, exploring the solution space effectively. The follower loop, composed of doctoral student agents, is responsible for executing the best-evolved plan. It dynamically adjusts the plan during implementation via pre-hoc and post-hoc meetings, ensuring each step (e.g., drafting, coding) is well-supported by contextual and external observations. Extensive experiments on benchmarks like ACLAward and Laboratory show that DLMA generates research papers that achieve state-of-the-art scores in automated evaluation, significantly outperforming strong baselines. Ablation studies confirm the critical roles of both loops, with evolution driving novelty and execution ensuring soundness.
