Table of Contents
Fetching ...

Mimosa Framework: Toward Evolving Multi-Agent Systems for Scientific Research

Martin Legrand, Tao Jiang, Matthieu Feraud, Benjamin Navet, Yousouf Taghzouti, Fabien Gandon, Elise Dumont, Louis-Félix Nothias

Abstract

Current Autonomous Scientific Research (ASR) systems, despite leveraging large language models (LLMs) and agentic architectures, remain constrained by fixed workflows and toolsets that prevent adaptation to evolving tasks and environments. We introduce Mimosa, an evolving multi-agent framework that automatically synthesizes task-specific multi-agent workflows and iteratively refines them through experimental feedback. Mimosa leverages the Model Context Protocol (MCP) for dynamic tool discovery, generates workflow topologies via a meta-orchestrator, executes subtasks through code-generating agents that invoke available tools and scientific software libraries, and scores executions with an LLM-based judge whose feedback drives workflow refinement. On ScienceAgentBench, Mimosa achieves a success rate of 43.1% with DeepSeek-V3.2, surpassing both single-agent baselines and static multi-agent configurations. Our results further reveal that models respond heterogeneously to multi-agent decomposition and iterative learning, indicating that the benefits of workflow evolution depend on the capabilities of the underlying execution model. Beyond these benchmarks, Mimosa modular architecture and tool-agnostic design make it readily extensible, and its fully logged execution traces and archived workflows support auditability by preserving every analytical step for inspection and potential replication. Combined with domain-expert guidance, the framework has the potential to automate a broad range of computationally accessible scientific tasks across disciplines. Released as a fully open-source platform, Mimosa aims to provide an open foundation for community-driven ASR.

Mimosa Framework: Toward Evolving Multi-Agent Systems for Scientific Research

Abstract

Current Autonomous Scientific Research (ASR) systems, despite leveraging large language models (LLMs) and agentic architectures, remain constrained by fixed workflows and toolsets that prevent adaptation to evolving tasks and environments. We introduce Mimosa, an evolving multi-agent framework that automatically synthesizes task-specific multi-agent workflows and iteratively refines them through experimental feedback. Mimosa leverages the Model Context Protocol (MCP) for dynamic tool discovery, generates workflow topologies via a meta-orchestrator, executes subtasks through code-generating agents that invoke available tools and scientific software libraries, and scores executions with an LLM-based judge whose feedback drives workflow refinement. On ScienceAgentBench, Mimosa achieves a success rate of 43.1% with DeepSeek-V3.2, surpassing both single-agent baselines and static multi-agent configurations. Our results further reveal that models respond heterogeneously to multi-agent decomposition and iterative learning, indicating that the benefits of workflow evolution depend on the capabilities of the underlying execution model. Beyond these benchmarks, Mimosa modular architecture and tool-agnostic design make it readily extensible, and its fully logged execution traces and archived workflows support auditability by preserving every analytical step for inspection and potential replication. Combined with domain-expert guidance, the framework has the potential to automate a broad range of computationally accessible scientific tasks across disciplines. Released as a fully open-source platform, Mimosa aims to provide an open foundation for community-driven ASR.

Paper Structure

This paper contains 61 sections, 7 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The Mimosa Framework. The system operates through five sequential layers: (0) Divide into sub-tasks --- the planner divides the user objective into sub-tasks; (1) Discover Tools --- available MCP servers are scanned and enumerated via Toolomics; (2) Initialize Workflow --- the archive is queried for a similar prior task, and the best matching workflow is retrieved or synthesized from scratch; (3) Execute Workflow --- specialized agents execute the task using discovered relevant tools; (4) Evaluate Workflow --- a judge scores the execution and the meta-orchestrator mutates the workflow for the next iteration; the evaluated workflow is archived, and the loop repeats until the judge score exceeds 0.9 or the predefined number of iterations is reached.
  • Figure 2: Iterative workflow refinement via single-incumbent search. (1) Mutate Incumbent Workflow --- at each iteration, the meta-orchestrator takes the best-performing workflow observed so far (the incumbent) together with its evaluation feedback and proposes a single local modification to generate a mutated workflow. (2) Execute Workflow --- each node in the workflow is executed by a SmolAgent CodeAgent instance. (3) Evaluate Workflow --- the judge evaluates the resulting execution trace across goal alignment, collaboration efficiency, output quality, and answer plausibility, and returns structured feedback with an overall score. (4) Archive and Select Incumbent --- the evaluated workflow is archived, and the highest-scoring workflow observed so far is retained as the incumbent for the next iteration. If the new workflow does not improve the score, the previous incumbent is kept. The cycle repeats for a predefined number of iterations or until the judge score exceeds 0.9.
  • Figure 3: Reward gains from successive evolution iterations. Average change in reward relative to the previous iteration with SEM error bars. Data is pooled across runs from all evaluated models (GPT-4o, DeepSeek-V3.2, Claude Haiku 4.5); per-model breakdowns are discussed in the text.
  • Figure 4: Statistical validation of evolution efficacy. (a) Mean gain by cohort with 95% bootstrap confidence intervals. All cohorts (2–9) except cohort 10 show significant positive gains (green bars). (b) Probability of improvement (green) versus regression (red) at each transition. Early transitions show improvement rates >50%, while transitions 8→9 and 9→10 favor regression. (c) Permutation test distribution vs observed mean gain (red line). The observed gain significantly exceeds the null distribution (p < 0.0001), confirming improvement is not due to random chance. (d) Effect size (Cohen's d) by cohort. Cohorts 2–9 achieve large effect sizes (d > 0.8), with cohort 4 showing the strongest effect (d = 2.75). All statistical tests are computed on pooled data across models; per-model analyses are deferred to a future revision.