Table of Contents
Fetching ...

RobuNFR: Evaluating the Robustness of Large Language Models on Non-Functional Requirements Aware Code Generation

Feng Lin, Dong Jae Kim, Zhenhao Li, Jinqiu Yang, Tse-Hsun, Chen

TL;DR

RobuNFR introduces an automated framework to evaluate the robustness of LLMs when generating code with non-functional requirements (NFRs). It defines four NFR dimensions (design, readability, reliability, and performance) and three evaluation methodologies: prompt variations, regression testing, and NFR-aware code generation workflows. The study demonstrates that incorporating NFRs can markedly reduce functional correctness (Pass@1) and increase result variability, while improving NFR-related quality in a model-dependent manner; model updates and workflow choices expose trade-offs and robustness gaps. The work provides replication data and emphasizes continuous quality assurance for deploying LLM-based development tools.

Abstract

When using LLMs to address Non-Functional Requirements (NFRs), developers may behave differently (e.g., expressing the same NFR in different words). Robust LLMs should output consistent results across these variations; however, this aspect remains underexplored. We propose RobuNFR for evaluating the robustness of LLMs in NFR-aware code generation across four NFR dimensions: design, readability, reliability, and performance, using three methodologies: prompt variation, regression testing, and diverse workflows. Our experiments show that RobuNFR reveals robustness issues in the tested LLMs when considering NFRs in code generation. Specifically, under prompt variation, including NFRs leads to a decrease in Pass@1 by up to 39 percent and an increase in the standard deviation from 0.48 to 2.48 compared to the baseline without NFRs (i.e., Function-Only). While incorporating NFRs generally improves overall NFR metrics, it also results in higher prompt sensitivity. In regression settings, some LLMs exhibit differences across versions, with improvements in one aspect (e.g., reduced code smells) often accompanied by regressions in another (e.g., decreased correctness), revealing inconsistencies that challenge their robustness. When varying workflows, the tested LLMs show significantly different NFR-aware code generation capabilities between two workflows: (1) integrating NFRs and functional requirements into the initial prompt and (2) enhancing Function-Only-generated code with the same NFR.

RobuNFR: Evaluating the Robustness of Large Language Models on Non-Functional Requirements Aware Code Generation

TL;DR

RobuNFR introduces an automated framework to evaluate the robustness of LLMs when generating code with non-functional requirements (NFRs). It defines four NFR dimensions (design, readability, reliability, and performance) and three evaluation methodologies: prompt variations, regression testing, and NFR-aware code generation workflows. The study demonstrates that incorporating NFRs can markedly reduce functional correctness (Pass@1) and increase result variability, while improving NFR-related quality in a model-dependent manner; model updates and workflow choices expose trade-offs and robustness gaps. The work provides replication data and emphasizes continuous quality assurance for deploying LLM-based development tools.

Abstract

When using LLMs to address Non-Functional Requirements (NFRs), developers may behave differently (e.g., expressing the same NFR in different words). Robust LLMs should output consistent results across these variations; however, this aspect remains underexplored. We propose RobuNFR for evaluating the robustness of LLMs in NFR-aware code generation across four NFR dimensions: design, readability, reliability, and performance, using three methodologies: prompt variation, regression testing, and diverse workflows. Our experiments show that RobuNFR reveals robustness issues in the tested LLMs when considering NFRs in code generation. Specifically, under prompt variation, including NFRs leads to a decrease in Pass@1 by up to 39 percent and an increase in the standard deviation from 0.48 to 2.48 compared to the baseline without NFRs (i.e., Function-Only). While incorporating NFRs generally improves overall NFR metrics, it also results in higher prompt sensitivity. In regression settings, some LLMs exhibit differences across versions, with improvements in one aspect (e.g., reduced code smells) often accompanied by regressions in another (e.g., decreased correctness), revealing inconsistencies that challenge their robustness. When varying workflows, the tested LLMs show significantly different NFR-aware code generation capabilities between two workflows: (1) integrating NFRs and functional requirements into the initial prompt and (2) enhancing Function-Only-generated code with the same NFR.

Paper Structure

This paper contains 18 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Simplified examples of Generated Code: One Without Performance Considerations, and Two With Performance Considerations Using Different Prompts.
  • Figure 2: Overview of RobuNFR. RobuNFR leverages three methodologies to evaluate the code generation capabilities of LLM across NFR dimensions using various code benchmarks, aiming to reveal potential robustness issues in the LLM under test.
  • Figure 3: RobuNFR defines three workflows as part of its NFR-aware code generation evaluation methodology. These workflows include Function-Only code generation, NFR-Integrated code generation, and NFR-Enhanced code refinement. We compare the functional and non-functional quality of the generated code across these workflows.
  • Figure 4: A simplified example of a prompt template for NFR-Aware code generation workflows.