Table of Contents
Fetching ...

SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents

Feng Lin, Dong Jae Kim, Tse-Husn, Chen

TL;DR

This work introduces FlowGen, an agent-based framework that emulates established software development processes by assigning LLM agents to roles such as requirements, design, implementation, and testing. By implementing Waterfall, TDD, and Scrum variants with self-refinement and cross-role feedback, FlowGen demonstrates substantial gains in code correctness (Pass@1) and code quality on four Python benchmarks, with FlowGen Scrum delivering the strongest and most stable results. The study also shows that design, code review, and testing reduce code smells and enhance exception handling, and that integrating CodeT can further improve performance. These findings suggest that structured software-process practices can meaningfully enhance LLM-driven code generation, informing future research on human-in-the-loop collaboration and benchmark design in AI-assisted software engineering.

Abstract

Software process models are essential to facilitate collaboration and communication among software teams to solve complex development tasks. Inspired by these software engineering practices, we present FlowGen - a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents. We emulate three process models, FlowGenWaterfall, FlowGenTDD, and FlowGenScrum, by assigning LLM agents to embody roles (i.e., requirement engineer, architect, developer, tester, and scrum master) that correspond to everyday development activities and organize their communication patterns. The agents work collaboratively using chain-of-thought and prompt composition with continuous self-refinement to improve the code quality. We use GPT3.5 as our underlying LLM and several baselines (RawGPT, CodeT, Reflexion) to evaluate code generation on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET. Our findings show that FlowGenScrum excels compared to other process models, achieving a Pass@1 of 75.2, 65.5, 82.5, and 56.7 in HumanEval, HumanEval-ET, MBPP, and MBPP-ET, respectively (an average of 15% improvement over RawGPT). Compared with other state-of-the-art techniques, FlowGenScrum achieves a higher Pass@1 in MBPP compared to CodeT, with both outperforming Reflexion. Notably, integrating CodeT into FlowGenScrum resulted in statistically significant improvements, achieving the highest Pass@1 scores. Our analysis also reveals that the development activities impacted code smell and exception handling differently, with design and code review adding more exception handling and reducing code smells. Finally, FlowGen models maintain stable Pass@1 scores across GPT3.5 versions and temperature values, highlighting the effectiveness of software process models in enhancing the quality and stability of LLM-generated code.

SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents

TL;DR

This work introduces FlowGen, an agent-based framework that emulates established software development processes by assigning LLM agents to roles such as requirements, design, implementation, and testing. By implementing Waterfall, TDD, and Scrum variants with self-refinement and cross-role feedback, FlowGen demonstrates substantial gains in code correctness (Pass@1) and code quality on four Python benchmarks, with FlowGen Scrum delivering the strongest and most stable results. The study also shows that design, code review, and testing reduce code smells and enhance exception handling, and that integrating CodeT can further improve performance. These findings suggest that structured software-process practices can meaningfully enhance LLM-driven code generation, informing future research on human-in-the-loop collaboration and benchmark design in AI-assisted software engineering.

Abstract

Software process models are essential to facilitate collaboration and communication among software teams to solve complex development tasks. Inspired by these software engineering practices, we present FlowGen - a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents. We emulate three process models, FlowGenWaterfall, FlowGenTDD, and FlowGenScrum, by assigning LLM agents to embody roles (i.e., requirement engineer, architect, developer, tester, and scrum master) that correspond to everyday development activities and organize their communication patterns. The agents work collaboratively using chain-of-thought and prompt composition with continuous self-refinement to improve the code quality. We use GPT3.5 as our underlying LLM and several baselines (RawGPT, CodeT, Reflexion) to evaluate code generation on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET. Our findings show that FlowGenScrum excels compared to other process models, achieving a Pass@1 of 75.2, 65.5, 82.5, and 56.7 in HumanEval, HumanEval-ET, MBPP, and MBPP-ET, respectively (an average of 15% improvement over RawGPT). Compared with other state-of-the-art techniques, FlowGenScrum achieves a higher Pass@1 in MBPP compared to CodeT, with both outperforming Reflexion. Notably, integrating CodeT into FlowGenScrum resulted in statistically significant improvements, achieving the highest Pass@1 scores. Our analysis also reveals that the development activities impacted code smell and exception handling differently, with design and code review adding more exception handling and reducing code smells. Finally, FlowGen models maintain stable Pass@1 scores across GPT3.5 versions and temperature values, highlighting the effectiveness of software process models in enhancing the quality and stability of LLM-generated code.
Paper Structure (12 sections, 3 figures, 5 tables)

This paper contains 12 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: An overview of $\textit{FlowGen}\xspace_{\textit{Waterfall}\xspace}$, $\textit{FlowGen}\xspace_{\textit{TDD}\xspace}$, and $\textit{FlowGen}\xspace_{\textit{Scrum}\xspace}$.
  • Figure 2: Pass@1 across GPT3.5 versions.
  • Figure 3: Pass@1 across temperature values.