Table of Contents
Fetching ...

ArchCode: Incorporating Software Requirements in Code Generation with Large Language Models

Hojae Han, Jaejin Kim, Jaeseok Yoo, Youngwon Lee, Seung-won Hwang

TL;DR

ArchCode tackles the problem of generating code that satisfies both functional and non-functional software requirements described in natural language. It uses in-context learning to extract expressed and inferred FRs and NFRs, then generates code and per-requirement test cases in parallel, with a weighted, test-case–driven ranking to select solutions that best meet the requirements. The framework introduces HumanEval-NFR to evaluate non-functional aspects and demonstrates state-of-the-art or near-state-of-the-art performance on HumanEval and CodeContests, while achieving significantly fewer test cases than prior methods. Across diverse settings, ArchCode shows robustness, efficiency, and generalizability, including extensions to open-source LLMs and Java, making it a practical approach for requirement-aware AI-assisted code generation.

Abstract

This paper aims to extend the code generation capability of large language models (LLMs) to automatically manage comprehensive software requirements from given textual descriptions. Such requirements include both functional (i.e. achieving expected behavior for inputs) and non-functional (e.g., time/space performance, robustness, maintainability) requirements. However, textual descriptions can either express requirements verbosely or may even omit some of them. We introduce ARCHCODE, a novel framework that leverages in-context learning to organize requirements observed in descriptions and to extrapolate unexpressed requirements from them. ARCHCODE generates requirements from given descriptions, conditioning them to produce code snippets and test cases. Each test case is tailored to one of the requirements, allowing for the ranking of code snippets based on the compliance of their execution results with the requirements. Public benchmarks show that ARCHCODE enhances to satisfy functional requirements, significantly improving Pass@k scores. Furthermore, we introduce HumanEval-NFR, the first evaluation of LLMs' non-functional requirements in code generation, demonstrating ARCHCODE's superiority over baseline methods. The implementation of ARCHCODE and the HumanEval-NFR benchmark are both publicly accessible.

ArchCode: Incorporating Software Requirements in Code Generation with Large Language Models

TL;DR

ArchCode tackles the problem of generating code that satisfies both functional and non-functional software requirements described in natural language. It uses in-context learning to extract expressed and inferred FRs and NFRs, then generates code and per-requirement test cases in parallel, with a weighted, test-case–driven ranking to select solutions that best meet the requirements. The framework introduces HumanEval-NFR to evaluate non-functional aspects and demonstrates state-of-the-art or near-state-of-the-art performance on HumanEval and CodeContests, while achieving significantly fewer test cases than prior methods. Across diverse settings, ArchCode shows robustness, efficiency, and generalizability, including extensions to open-source LLMs and Java, making it a practical approach for requirement-aware AI-assisted code generation.

Abstract

This paper aims to extend the code generation capability of large language models (LLMs) to automatically manage comprehensive software requirements from given textual descriptions. Such requirements include both functional (i.e. achieving expected behavior for inputs) and non-functional (e.g., time/space performance, robustness, maintainability) requirements. However, textual descriptions can either express requirements verbosely or may even omit some of them. We introduce ARCHCODE, a novel framework that leverages in-context learning to organize requirements observed in descriptions and to extrapolate unexpressed requirements from them. ARCHCODE generates requirements from given descriptions, conditioning them to produce code snippets and test cases. Each test case is tailored to one of the requirements, allowing for the ranking of code snippets based on the compliance of their execution results with the requirements. Public benchmarks show that ARCHCODE enhances to satisfy functional requirements, significantly improving Pass@k scores. Furthermore, we introduce HumanEval-NFR, the first evaluation of LLMs' non-functional requirements in code generation, demonstrating ARCHCODE's superiority over baseline methods. The implementation of ARCHCODE and the HumanEval-NFR benchmark are both publicly accessible.
Paper Structure (48 sections, 3 equations, 6 figures, 37 tables)

This paper contains 48 sections, 3 equations, 6 figures, 37 tables.

Figures (6)

  • Figure 1: The ArchCode framework infers software requirements of correct code solution for a given textual description, then conditions them to generate code, as well as test cases for verification.
  • Figure 2: An illustrative example of code and test case generation. Existing approaches derive code and test cases directly from problem descriptions, often missing key requirements. ArchCode, in contrast, reformulates (underlined) and extrapolates (not underlined) requirements from these descriptions, then generates code and test cases to meet them comprehensively. Best viewed in color.
  • Figure 3: The overview of the ArchCode framework. Each color represents the subtype of software requirements. Underlined requirements are expressed in problem descriptions, whereas other requirements are inferred from descriptions by LLMs' parametric knowledge. Best viewed in color.
  • Figure 4: Pass@1 versus average number of test cases needed per problem on HumanEval. ArchCode ($\blackdiamond$) achieves the highest Pass@1 score with significantly less number of generated test cases. All values are obtained from GPT-3.5-Turbo. The values for MPSC ($\bullet$) and CodeT (cmy$\blacktriangle$) are from huang2023enhancing. Best viewed in color.
  • Figure 5: Pass@$1$ score of ArchCode for each requirement category in HumanEval-NFR. Using dedicated test cases for filtering consistently outperforms blindly using all test cases. Best viewed in color.
  • ...and 1 more figures