RobuNFR: Evaluating the Robustness of Large Language Models on Non-Functional Requirements Aware Code Generation

Feng Lin; Dong Jae Kim; Zhenhao Li; Jinqiu Yang; Tse-Hsun; Chen

RobuNFR: Evaluating the Robustness of Large Language Models on Non-Functional Requirements Aware Code Generation

Feng Lin, Dong Jae Kim, Zhenhao Li, Jinqiu Yang, Tse-Hsun, Chen

TL;DR

RobuNFR introduces an automated framework to evaluate the robustness of LLMs when generating code with non-functional requirements (NFRs). It defines four NFR dimensions (design, readability, reliability, and performance) and three evaluation methodologies: prompt variations, regression testing, and NFR-aware code generation workflows. The study demonstrates that incorporating NFRs can markedly reduce functional correctness (Pass@1) and increase result variability, while improving NFR-related quality in a model-dependent manner; model updates and workflow choices expose trade-offs and robustness gaps. The work provides replication data and emphasizes continuous quality assurance for deploying LLM-based development tools.

Abstract

When using LLMs to address Non-Functional Requirements (NFRs), developers may behave differently (e.g., expressing the same NFR in different words). Robust LLMs should output consistent results across these variations; however, this aspect remains underexplored. We propose RobuNFR for evaluating the robustness of LLMs in NFR-aware code generation across four NFR dimensions: design, readability, reliability, and performance, using three methodologies: prompt variation, regression testing, and diverse workflows. Our experiments show that RobuNFR reveals robustness issues in the tested LLMs when considering NFRs in code generation. Specifically, under prompt variation, including NFRs leads to a decrease in Pass@1 by up to 39 percent and an increase in the standard deviation from 0.48 to 2.48 compared to the baseline without NFRs (i.e., Function-Only). While incorporating NFRs generally improves overall NFR metrics, it also results in higher prompt sensitivity. In regression settings, some LLMs exhibit differences across versions, with improvements in one aspect (e.g., reduced code smells) often accompanied by regressions in another (e.g., decreased correctness), revealing inconsistencies that challenge their robustness. When varying workflows, the tested LLMs show significantly different NFR-aware code generation capabilities between two workflows: (1) integrating NFRs and functional requirements into the initial prompt and (2) enhancing Function-Only-generated code with the same NFR.

RobuNFR: Evaluating the Robustness of Large Language Models on Non-Functional Requirements Aware Code Generation

TL;DR

Abstract

RobuNFR: Evaluating the Robustness of Large Language Models on Non-Functional Requirements Aware Code Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)