Operational Robustness of LLMs on Code Generation
Debalina Ghosh Paul, Hong Zhu, Ian Bayley
TL;DR
The paper tackles the problem of evaluating how robust large language models are when generating code, focusing on micro-level sensitivity to variations in natural-language task descriptions. It introduces scenario domain analysis, a formal framework that defines a safe zone around a seed task description and uses paraphrase-based mutants to probe the boundary of correct code generation, quantified by minimal distance to failure $\Delta_{D^s}(t,M)$ and maximal safe distance $\nabla_{D^s}(t,M)$. By leveraging datamorphic testing, paraphrase generation with word-embedding neighborhoods, and multiple distance metrics (e.g., RoBERTa-based semantic similarity, Levenshtein, BLEU/ROUGE), the approach estimates robustness metrics $\rho^{o}$ and $\rho^{*}$ via $R^{o}$ and $R^{*}$, while comparing multiple state-of-the-art LLMs (Gemini-Pro, Codex, Llama2, Falcon7B). Experiments on 900 coding tasks from the ScenEval benchmark show robustness degrades with task complexity and advanced topics, and demonstrate the method’s efficiency (fewer than 20 LLM queries per task) and flexibility across similarity metrics. The work provides a practical, scalable framework for micro robustness assessment in code generation, with implications for tool selection and reliability in software engineering workflows. Future work would extend the methodology to other programming languages and code-quality dimensions (e.g., code smells).
Abstract
It is now common practice in software development for large language models (LLMs) to be used to generate program code. It is desirable to evaluate the robustness of LLMs for this usage. This paper is concerned in particular with how sensitive LLMs are to variations in descriptions of the coding tasks. However, existing techniques for evaluating this robustness are unsuitable for code generation because the input data space of natural language descriptions is discrete. To address this problem, we propose a robustness evaluation method called scenario domain analysis, which aims to find the expected minimal change in the natural language descriptions of coding tasks that would cause the LLMs to produce incorrect outputs. We have formally proved the theoretical properties of the method and also conducted extensive experiments to evaluate the robustness of four state-of-the-art art LLMs: Gemini-pro, Codex, Llamma2 and Falcon 7B, and have found that we are able to rank these with confidence from best to worst. Moreover, we have also studied how robustness varies in different scenarios, including the variations with the topic of the coding task and with the complexity of its sample solution, and found that robustness is lower for more complex tasks and also lower for more advanced topics, such as multi-threading and data structures.
