Operational Robustness of LLMs on Code Generation

Debalina Ghosh Paul; Hong Zhu; Ian Bayley

Operational Robustness of LLMs on Code Generation

Debalina Ghosh Paul, Hong Zhu, Ian Bayley

TL;DR

The paper tackles the problem of evaluating how robust large language models are when generating code, focusing on micro-level sensitivity to variations in natural-language task descriptions. It introduces scenario domain analysis, a formal framework that defines a safe zone around a seed task description and uses paraphrase-based mutants to probe the boundary of correct code generation, quantified by minimal distance to failure $\Delta_{D^s}(t,M)$ and maximal safe distance $\nabla_{D^s}(t,M)$. By leveraging datamorphic testing, paraphrase generation with word-embedding neighborhoods, and multiple distance metrics (e.g., RoBERTa-based semantic similarity, Levenshtein, BLEU/ROUGE), the approach estimates robustness metrics $\rho^{o}$ and $\rho^{*}$ via $R^{o}$ and $R^{*}$, while comparing multiple state-of-the-art LLMs (Gemini-Pro, Codex, Llama2, Falcon7B). Experiments on 900 coding tasks from the ScenEval benchmark show robustness degrades with task complexity and advanced topics, and demonstrate the method’s efficiency (fewer than 20 LLM queries per task) and flexibility across similarity metrics. The work provides a practical, scalable framework for micro robustness assessment in code generation, with implications for tool selection and reliability in software engineering workflows. Future work would extend the methodology to other programming languages and code-quality dimensions (e.g., code smells).

Abstract

It is now common practice in software development for large language models (LLMs) to be used to generate program code. It is desirable to evaluate the robustness of LLMs for this usage. This paper is concerned in particular with how sensitive LLMs are to variations in descriptions of the coding tasks. However, existing techniques for evaluating this robustness are unsuitable for code generation because the input data space of natural language descriptions is discrete. To address this problem, we propose a robustness evaluation method called scenario domain analysis, which aims to find the expected minimal change in the natural language descriptions of coding tasks that would cause the LLMs to produce incorrect outputs. We have formally proved the theoretical properties of the method and also conducted extensive experiments to evaluate the robustness of four state-of-the-art art LLMs: Gemini-pro, Codex, Llamma2 and Falcon 7B, and have found that we are able to rank these with confidence from best to worst. Moreover, we have also studied how robustness varies in different scenarios, including the variations with the topic of the coding task and with the complexity of its sample solution, and found that robustness is lower for more complex tasks and also lower for more advanced topics, such as multi-threading and data structures.

Operational Robustness of LLMs on Code Generation

TL;DR

and maximal safe distance

. By leveraging datamorphic testing, paraphrase generation with word-embedding neighborhoods, and multiple distance metrics (e.g., RoBERTa-based semantic similarity, Levenshtein, BLEU/ROUGE), the approach estimates robustness metrics

and

via

and

, while comparing multiple state-of-the-art LLMs (Gemini-Pro, Codex, Llama2, Falcon7B). Experiments on 900 coding tasks from the ScenEval benchmark show robustness degrades with task complexity and advanced topics, and demonstrate the method’s efficiency (fewer than 20 LLM queries per task) and flexibility across similarity metrics. The work provides a practical, scalable framework for micro robustness assessment in code generation, with implications for tool selection and reliability in software engineering workflows. Future work would extend the methodology to other programming languages and code-quality dimensions (e.g., code smells).

Abstract

Paper Structure (49 sections, 14 theorems, 77 equations, 7 figures, 11 tables, 2 algorithms)

This paper contains 49 sections, 14 theorems, 77 equations, 7 figures, 11 tables, 2 algorithms.

Introduction
Related Work
Adversarial Robustness
White-Box Generation of AEs
Black-Box Generation of AEs for Computer Vision
Black-Box Generation of AEs for NLP
Metrics of Adversarial Robustness
Operational Robustness
Identification of Test Scenarios
Construction of Test Dataset
Evaluation of ML Model's Performance
Metrics for Evaluation of Operational Robustness
The Challenges And Our Approach
Generative Nature of the Task
Discrete and Sparse Data Space
...and 34 more sections

Key Result

Lemma 1

$\forall x \in D^s. \left(\|x,t\| < \Delta_{D^s}(t,M) \Rightarrow \neg Fails^M_t(x)\right)$. ∎

Figures (7)

Figure 1: Structure of Test System for Evaluation of Robustness
Figure 2: The Experiment Platform
Figure 3: Comparison of LLMs on Robustness of Code Generation
Figure 4: Variation of Robustness with Cyclomatic Complexity
Figure 5: Variation of Robustness with Topics
...and 2 more figures

Theorems & Definitions (21)

Definition 1
Lemma 1
Lemma 2
Definition 2
Definition 3
Lemma 3
Lemma 4
Lemma 5
Lemma 6
Theorem 1
...and 11 more

Operational Robustness of LLMs on Code Generation

TL;DR

Abstract

Operational Robustness of LLMs on Code Generation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (21)