Table of Contents
Fetching ...

MapCoder-Lite: Distilling Multi-Agent Coding into a Single Small LLM

Woongkyu Lee, Junhee Cho, Jungwook Choi

TL;DR

The paper tackles the challenge of multi-agent code generation within competitive programming by enabling a single small language model to emulate a multi-agent system. It introduces MapCoder-Lite, a 7B backbone enhanced with pass-based trajectory distillation, supervisor-guided refinement, and agent-wise LoRA adapters to achieve high accuracy with reduced resource demands. Empirical results on xCodeEval, APPS, and CodeContests show substantial gains over baselines, elimination of format failures, and strong efficiency benefits relative to 32B models, with notable generalization to other backbones. The work demonstrates that targeted, role-aligned fine-tuning can unlock robust, on-device-like multi-agent coding capabilities in compact models, with broad implications for accessible AI-assisted programming.

Abstract

Large language models (LLMs) have advanced code generation from single-function tasks to competitive-programming problems, but existing multi-agent solutions either rely on costly large-scale (>30B) models or collapse when downsized to small open-source models. We present MapCoder-Lite, a framework for distilling the complex reasoning of large, multi-agent coding systems into a single 7B model. Our contribution is a novel, three-pillar methodology that synergistically generates, refines, and encodes multi-agent knowledge: (i) pass-based trajectory distillation from strong LLMs fixes format fragility in retrieval and reduces failures in debugging, (ii) supervisor-guided correction with global feedback strengthens planning and coding agents, and (iii) agent-wise LoRA fine-tuning delivers memory-efficient specialisation. Comprehensive evaluation on xCodeEval, APPS, and CodeContests shows that MapCoder-Lite more than doubles xCodeEval accuracy (from 13.2% to 28.3%), eliminates all format failures, while reducing GPU memory and token-generation time by 4x compared to a 32B model. It also achieves over 10% gains on simpler coding benchmarks, demonstrating broad improvements beyond competitive programming. These results demonstrate that careful agent-wise fine-tuning unleashes high-quality multi-agent coding on a small language model. Our code is publicly available at https://github.com/aiha-lab/MapCoder-Lite.

MapCoder-Lite: Distilling Multi-Agent Coding into a Single Small LLM

TL;DR

The paper tackles the challenge of multi-agent code generation within competitive programming by enabling a single small language model to emulate a multi-agent system. It introduces MapCoder-Lite, a 7B backbone enhanced with pass-based trajectory distillation, supervisor-guided refinement, and agent-wise LoRA adapters to achieve high accuracy with reduced resource demands. Empirical results on xCodeEval, APPS, and CodeContests show substantial gains over baselines, elimination of format failures, and strong efficiency benefits relative to 32B models, with notable generalization to other backbones. The work demonstrates that targeted, role-aligned fine-tuning can unlock robust, on-device-like multi-agent coding capabilities in compact models, with broad implications for accessible AI-assisted programming.

Abstract

Large language models (LLMs) have advanced code generation from single-function tasks to competitive-programming problems, but existing multi-agent solutions either rely on costly large-scale (>30B) models or collapse when downsized to small open-source models. We present MapCoder-Lite, a framework for distilling the complex reasoning of large, multi-agent coding systems into a single 7B model. Our contribution is a novel, three-pillar methodology that synergistically generates, refines, and encodes multi-agent knowledge: (i) pass-based trajectory distillation from strong LLMs fixes format fragility in retrieval and reduces failures in debugging, (ii) supervisor-guided correction with global feedback strengthens planning and coding agents, and (iii) agent-wise LoRA fine-tuning delivers memory-efficient specialisation. Comprehensive evaluation on xCodeEval, APPS, and CodeContests shows that MapCoder-Lite more than doubles xCodeEval accuracy (from 13.2% to 28.3%), eliminates all format failures, while reducing GPU memory and token-generation time by 4x compared to a 32B model. It also achieves over 10% gains on simpler coding benchmarks, demonstrating broad improvements beyond competitive programming. These results demonstrate that careful agent-wise fine-tuning unleashes high-quality multi-agent coding on a small language model. Our code is publicly available at https://github.com/aiha-lab/MapCoder-Lite.

Paper Structure

This paper contains 34 sections, 12 figures, 15 tables.

Figures (12)

  • Figure 1: Overview of the MapCoder system. Given a natural language problem, the retrieval agent fetches relevant algorithmic knowledge, followed by the planning agent generating a solution plan. The coding agent implements the plan, and the debugging agent iteratively refines the code based on test outcomes.
  • Figure 2: Representative failure cases of the 7B-scale model across all agents. (a) Retrieval: invalid XML format and incorrect algorithm. (b) Planning: missing key step. (c) Coding: misinterpreted input specification. (d) Debugging: persistent unresolved error. Detailed illustrations are provided in Appendix \ref{['sec:improvements']}.
  • Figure 3: Illustration of trajectory construction for retrieval and debugging datasets. (a) When a strong LLM generates a complete solution that passes unit tests, the retrieval agent’s input-output pair is extracted as a training example. (b) To collect debugging data, a 7B model is used for planning and coding, and the strong LLM is used for debugging when the initial code fails. If the revised output passes, the debugging trajectory is added to the dataset.
  • Figure 4: Cosine similarity between algorithm descriptions generated by base and fine-tuned models and ground-truth algorithm tags for 50 xCodeEval problems. Darker cells indicate stronger alignment.
  • Figure 5: Supervisor-aided data collection pipeline. When the final output fails, the supervisor inspects the full trajectory (including algorithm, plan, code, and test result), identifies the responsible agent, and provides targeted feedback to revise its output. If the revised result passes, the updated trajectory is added to the fine-tuning dataset.
  • ...and 7 more figures