SEMAG: Self-Evolutionary Multi-Agent Code Generation

Yulin Peng; Haowen Hou; Xinxin Zhu; Ying Tiffany He; F. Richard Yu

SEMAG: Self-Evolutionary Multi-Agent Code Generation

Yulin Peng, Haowen Hou, Xinxin Zhu, Ying Tiffany He, F. Richard Yu

Abstract

Large Language Models (LLMs) have made significant progress in handling complex programming tasks. However, current methods rely on manual model selection and fixed workflows, which limit their ability to adapt to changing task complexities. To address this, we propose SEMAG, a Self-Evolutionary Multi-Agent code Generation framework that mimics human coding practices. It decomposes programming tasks into stages, including planning, coding, debugging, and discussion, while adapting workflows to task difficulty. Its self-evolutionary agents can access the latest models in real time and automatically upgrade the backbone model. SEMAG sets new state-of-the-art Pass@1 accuracy across benchmarks. Using identical backbone models, SEMAG outperforms prior methods by 3.3% on CodeContests. When augmented with self-evolutionary model selection that automatically identifies optimal backbones, SEMAG reaches 52.6%, showcasing both framework effectiveness and adaptability to evolving LLM capabilities.

SEMAG: Self-Evolutionary Multi-Agent Code Generation

Abstract

Paper Structure (22 sections, 15 equations, 34 figures, 6 tables, 1 algorithm)

This paper contains 22 sections, 15 equations, 34 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Traditional Approaches to Program Synthesis
Large Language Models for Code Synthesis
Prompting and Debugging Techniques
Method
Problem Formulation
Hierarchical Code Synthesis Framework
Adaptive Level Transition Mechanism
Self-Evolution Mechanism
Experiments
Experimental Setup
Main Results
Ablations Studies and Analyses
Conclusion
...and 7 more sections

Figures (34)

Figure 1: Overview workflow of Self-Evolution Agents. Agents integrate insights from recent research, news, and community discussions, dynamically identify and deploy the most suitable models.
Figure 2: Overview of SEMAG. (1) Self-Evolve: Agents dynamically select optimal backbone LLMs per task requirements. (2) Plan: Planning Agent creates solution plans validated by Plan Verifying Agent through I/O simulation. (3) Debug: Coding Agent generates code; upon failure, specialized agents (Embedding Trace, Code Explaining, Suggesting, Debugging) collaboratively refine using trace logs. (4) Debate: When debugging stalls, Debating Agents propose alternatives with Discriminating Agent selecting the optimal configuration.
Figure 3: Pass@1 accuracy on CodeContests using GPT-4o(2024-05-13), GPT-4.1(2025-04-14), DeepSeek-v3(2025-03-24) and Claude-3.7-Sonnet(2025-02-19).
Figure 4: Comparison of Pass@1 accuracy and average token count per question for LPW and SEMAG across benchmarks, using GPT-4o as the LLM backbone. Here, $K=10^3$.
Figure 5: Pass@1 accuracy (right y-axis) and its variance (left y-axis, scaled by $\times 10^{-4}$) on the HumanEval benchmark using GPT-3.5 as the backbone, measured over three independent runs for each temperature setting (0.1 to 1.0).
...and 29 more figures

SEMAG: Self-Evolutionary Multi-Agent Code Generation

Abstract

SEMAG: Self-Evolutionary Multi-Agent Code Generation

Authors

Abstract

Table of Contents

Figures (34)