Table of Contents
Fetching ...

Towards Realistic Project-Level Code Generation via Multi-Agent Collaboration and Semantic Architecture Modeling

Qianhui Zhao, Li Zhang, Fang Liu, Junhang Cheng, Chengru Wu, Junchen Ai, Qiaoyuanhe Meng, Lichen Zhang, Xiaoli Lian, Shubin Song, Yuanping Guo

TL;DR

ProjectGen tackles the challenge of generating complete software projects directly from user requirements. It introduces SSAT, a Semantic Software Architecture Tree, and a three-phase, multi-agent framework (architecture design, skeleton generation, code filling) with memory-based iterative refinement to curb error propagation. The CodeProjectEval dataset provides real-world-scale projects with executable tests for automated evaluation, complementing DevBench for small-scale assessment. Results demonstrate state-of-the-art performance on both small-scale and larger-scale tasks, with ablations confirming the value of SSAT and iterative optimization for reliable end-to-end project synthesis.

Abstract

In recent years, Large Language Models (LLMs) have achieved remarkable progress in automated code generation. In real-world software engineering, the growing demand for rapid iteration and continuous delivery underscores the importance of project-level code generation, where LLMs are expected to generate complete software projects directly from complex user requirements. Although existing studies have made initial explorations, they still face key limitations, including unrealistic datasets and unreliable evaluation metrics that fail to reflect real-world complexity, the semantic gap between human-written requirements and machine-interpretable structures, and difficulties in managing hierarchical dependencies and maintaining quality throughout the generation process. To address these limitations, we first introduce CodeProjectEval, a project-level code generation dataset built from 18 real-world repositories with 12.7 files and 2,388.6 lines of code per task on average, supplemented with documentation and executable test cases for automatic evaluation. We further propose ProjectGen, a multi-agent framework that decomposes projects into architecture design, skeleton generation, and code filling stages with iterative refinement and memory-based context management. Within this framework, we introduce the Semantic Software Architecture Tree (SSAT), a structured and semantically rich representation that effectively bridges user requirements and source code implementation. Experiments show that ProjectGen achieves state-of-the-art performance, passing 52/124 test cases on the small-scale project-level code generation dataset DevBench, a 57% improvement over the baseline approaches, and 310 test cases on CodeProjectEval, representing an improvement of roughly tenfold compared to the baselines.

Towards Realistic Project-Level Code Generation via Multi-Agent Collaboration and Semantic Architecture Modeling

TL;DR

ProjectGen tackles the challenge of generating complete software projects directly from user requirements. It introduces SSAT, a Semantic Software Architecture Tree, and a three-phase, multi-agent framework (architecture design, skeleton generation, code filling) with memory-based iterative refinement to curb error propagation. The CodeProjectEval dataset provides real-world-scale projects with executable tests for automated evaluation, complementing DevBench for small-scale assessment. Results demonstrate state-of-the-art performance on both small-scale and larger-scale tasks, with ablations confirming the value of SSAT and iterative optimization for reliable end-to-end project synthesis.

Abstract

In recent years, Large Language Models (LLMs) have achieved remarkable progress in automated code generation. In real-world software engineering, the growing demand for rapid iteration and continuous delivery underscores the importance of project-level code generation, where LLMs are expected to generate complete software projects directly from complex user requirements. Although existing studies have made initial explorations, they still face key limitations, including unrealistic datasets and unreliable evaluation metrics that fail to reflect real-world complexity, the semantic gap between human-written requirements and machine-interpretable structures, and difficulties in managing hierarchical dependencies and maintaining quality throughout the generation process. To address these limitations, we first introduce CodeProjectEval, a project-level code generation dataset built from 18 real-world repositories with 12.7 files and 2,388.6 lines of code per task on average, supplemented with documentation and executable test cases for automatic evaluation. We further propose ProjectGen, a multi-agent framework that decomposes projects into architecture design, skeleton generation, and code filling stages with iterative refinement and memory-based context management. Within this framework, we introduce the Semantic Software Architecture Tree (SSAT), a structured and semantically rich representation that effectively bridges user requirements and source code implementation. Experiments show that ProjectGen achieves state-of-the-art performance, passing 52/124 test cases on the small-scale project-level code generation dataset DevBench, a 57% improvement over the baseline approaches, and 310 test cases on CodeProjectEval, representing an improvement of roughly tenfold compared to the baselines.

Paper Structure

This paper contains 34 sections, 3 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Task input and output comparison between function-level and project-level code generation.
  • Figure 2: Detailed illustration of Semantic Software Architecture Tree. The left part shows the hierarchical organization of a repository, and the right part provides detailed examples of the elements contained in each type of node.
  • Figure 3: The workflow of ProjectGen. It takes user requirements as input and follows three sequential stages: architecture design, skeleton generation, and code filling, where the output of each stage serves as the input for the subsequent stage.
  • Figure 4: Comparison of project size between ground truth and code generated by ProjectGen using DeepSeek-V3.
  • Figure 5: Comparison of error types in code generated by ProjectGen on DevBench and CodeProjectEval.
  • ...and 1 more figures