Table of Contents
Fetching ...

SEMAG: Self-Evolutionary Multi-Agent Code Generation

Yulin Peng, Haowen Hou, Xinxin Zhu, Ying Tiffany He, F. Richard Yu

Abstract

Large Language Models (LLMs) have made significant progress in handling complex programming tasks. However, current methods rely on manual model selection and fixed workflows, which limit their ability to adapt to changing task complexities. To address this, we propose SEMAG, a Self-Evolutionary Multi-Agent code Generation framework that mimics human coding practices. It decomposes programming tasks into stages, including planning, coding, debugging, and discussion, while adapting workflows to task difficulty. Its self-evolutionary agents can access the latest models in real time and automatically upgrade the backbone model. SEMAG sets new state-of-the-art Pass@1 accuracy across benchmarks. Using identical backbone models, SEMAG outperforms prior methods by 3.3% on CodeContests. When augmented with self-evolutionary model selection that automatically identifies optimal backbones, SEMAG reaches 52.6%, showcasing both framework effectiveness and adaptability to evolving LLM capabilities.

SEMAG: Self-Evolutionary Multi-Agent Code Generation

Abstract

Large Language Models (LLMs) have made significant progress in handling complex programming tasks. However, current methods rely on manual model selection and fixed workflows, which limit their ability to adapt to changing task complexities. To address this, we propose SEMAG, a Self-Evolutionary Multi-Agent code Generation framework that mimics human coding practices. It decomposes programming tasks into stages, including planning, coding, debugging, and discussion, while adapting workflows to task difficulty. Its self-evolutionary agents can access the latest models in real time and automatically upgrade the backbone model. SEMAG sets new state-of-the-art Pass@1 accuracy across benchmarks. Using identical backbone models, SEMAG outperforms prior methods by 3.3% on CodeContests. When augmented with self-evolutionary model selection that automatically identifies optimal backbones, SEMAG reaches 52.6%, showcasing both framework effectiveness and adaptability to evolving LLM capabilities.
Paper Structure (22 sections, 15 equations, 34 figures, 6 tables, 1 algorithm)

This paper contains 22 sections, 15 equations, 34 figures, 6 tables, 1 algorithm.

Figures (34)

  • Figure 1: Overview workflow of Self-Evolution Agents. Agents integrate insights from recent research, news, and community discussions, dynamically identify and deploy the most suitable models.
  • Figure 2: Overview of SEMAG. (1) Self-Evolve: Agents dynamically select optimal backbone LLMs per task requirements. (2) Plan: Planning Agent creates solution plans validated by Plan Verifying Agent through I/O simulation. (3) Debug: Coding Agent generates code; upon failure, specialized agents (Embedding Trace, Code Explaining, Suggesting, Debugging) collaboratively refine using trace logs. (4) Debate: When debugging stalls, Debating Agents propose alternatives with Discriminating Agent selecting the optimal configuration.
  • Figure 3: Pass@1 accuracy on CodeContests using GPT-4o(2024-05-13), GPT-4.1(2025-04-14), DeepSeek-v3(2025-03-24) and Claude-3.7-Sonnet(2025-02-19).
  • Figure 4: Comparison of Pass@1 accuracy and average token count per question for LPW and SEMAG across benchmarks, using GPT-4o as the LLM backbone. Here, $K=10^3$.
  • Figure 5: Pass@1 accuracy (right y-axis) and its variance (left y-axis, scaled by $\times 10^{-4}$) on the HumanEval benchmark using GPT-3.5 as the backbone, measured over three independent runs for each temperature setting (0.1 to 1.0).
  • ...and 29 more figures