Table of Contents
Fetching ...

MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding

Siddeshwar Raghavan, Tanwi Mallick

TL;DR

MOSAIC introduces a training-free, LLM-agnostic multi-agent framework for scientific code generation that decomposes complex problems into chained subproblems using four specialized agents (Self-Reflection, Rationale, Coding, Debugger) and a teacher-guided knowledge-distillation workflow. A Consolidated Context Window and domain-specific memories mitigate hallucinations and cross-domain interference, enabling robust, executable code without validation I/O. Across SciCode and general coding benchmarks, MOSAIC achieves higher problem-solving accuracy and better precision than strong baselines, with ablations showing the value of orchestrating distinct expert roles. This work offers a scalable, interpretable approach to complex scientific programming, with potential for heterogeneous backbones and reinforcement learning from execution feedback to further enhance performance and reliability.

Abstract

We present MOSAIC, a multi-agent Large Language Model (LLM) framework for solving challenging scientific coding tasks. Unlike general-purpose coding, scientific workflows require algorithms that are rigorous, interconnected with deep domain knowledge, and incorporate domain-specific reasoning, as well as algorithm iteration without requiring I/O test cases. Many scientific problems also require a sequence of subproblems to be solved, leading to the final desired result. MOSAIC is designed as a training-free framework with specially designed agents to self-reflect, create the rationale, code, and debug within a student-teacher paradigm to address the challenges of scientific code generation. This design facilitates stepwise problem decomposition, targeted error correction, and, when combined with our Consolidated Context Window (CCW), mitigates LLM hallucinations when solving complex scientific tasks involving chained subproblems. We evaluate MOSAIC on scientific coding benchmarks and demonstrate that our specialized agentic framework outperforms existing approaches in terms of accuracy, robustness, and interpretability.

MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding

TL;DR

MOSAIC introduces a training-free, LLM-agnostic multi-agent framework for scientific code generation that decomposes complex problems into chained subproblems using four specialized agents (Self-Reflection, Rationale, Coding, Debugger) and a teacher-guided knowledge-distillation workflow. A Consolidated Context Window and domain-specific memories mitigate hallucinations and cross-domain interference, enabling robust, executable code without validation I/O. Across SciCode and general coding benchmarks, MOSAIC achieves higher problem-solving accuracy and better precision than strong baselines, with ablations showing the value of orchestrating distinct expert roles. This work offers a scalable, interpretable approach to complex scientific programming, with potential for heterogeneous backbones and reinforcement learning from execution feedback to further enhance performance and reliability.

Abstract

We present MOSAIC, a multi-agent Large Language Model (LLM) framework for solving challenging scientific coding tasks. Unlike general-purpose coding, scientific workflows require algorithms that are rigorous, interconnected with deep domain knowledge, and incorporate domain-specific reasoning, as well as algorithm iteration without requiring I/O test cases. Many scientific problems also require a sequence of subproblems to be solved, leading to the final desired result. MOSAIC is designed as a training-free framework with specially designed agents to self-reflect, create the rationale, code, and debug within a student-teacher paradigm to address the challenges of scientific code generation. This design facilitates stepwise problem decomposition, targeted error correction, and, when combined with our Consolidated Context Window (CCW), mitigates LLM hallucinations when solving complex scientific tasks involving chained subproblems. We evaluate MOSAIC on scientific coding benchmarks and demonstrate that our specialized agentic framework outperforms existing approaches in terms of accuracy, robustness, and interpretability.

Paper Structure

This paper contains 26 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of MOSAIC, a four-agent framework with domain independent memory. The design is inspired by knowledge distillation, where the Teacher module leverages few-shot examples from domain validation data to guide the Student module. This process enables the generation of clean rationales, which are then converted into accurate and executable code. The consolidated context window helps the agents focus on the current problem without being overwhelmed by previously generated information. (NOTE: When available, sample test cases from the dataset can also be incorporated by the Coding Agent)
  • Figure 2: Structure of problems and subproblems in the SciCode dataset. Each main problem is composed of multiple subproblems, all of which must be solved correctly for the main problem to be considered successfully solved.
  • Figure 3: Error statistics on SciCode benchmarks. The figure distinguishes syntactic errors (failed execution) from semantic errors (output–target mismatches). Semantic errors are shown in gray, while other colors represent different categories of syntactic errors. MOSAIC substantially reduces both the overall error rate and the relative proportion of syntactic errors compared to the baseline.
  • Figure 4: Precision differences between target and generated outputs. Compared to the baseline, MOSAIC produces a larger proportion of executable code, which introduces slightly more detectable errors. However, MOSAIC outputs exhibit substantially smaller deviations from the target values, indicating improved numerical precision.
  • Figure 5: The prompt template for our Self-Reflection Agent, which analyzes the ground truth code from the validation set to understand patterns from each domain. And prepare gold standard domain specific pseudocode to use as few shot examples by the Rationale Agent
  • ...and 4 more figures