Assured LLM-Based Software Engineering
Nadia Alshahwan, Mark Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, Eddy Wang
TL;DR
The paper argues for Assured LLM-Based Software Engineering (Assured LLMSE), a generate-and-test paradigm that uses semantic filters to ensure LLM-generated code maintains original properties while achieving verifiable improvements, primarily in offline settings due to the need for assurances. It frames this as a Genetic Improvement problem where LLMs generate candidate changes and SBSE guides prompt optimization, yielding both improved code and an evolving prompting strategy. The discussion covers local versus global performance optimization, with detailed examination of refactoring and debugging use cases, all underpinned by automated regression testing as the assurance oracle. The work identifies open research directions for migrating offline assurances online, creating efficient filters and prompting languages, and developing scalable, domain-aware search strategies to make assured LLMSE practical in real-world software engineering. Overall, the approach promises verifiable, autonomous code improvement with structured feedback loops between testing, prompting, and LLM reasoning, enabling safer deployment of AI-assisted software engineering.
Abstract
In this paper we address the following question: How can we use Large Language Models (LLMs) to improve code independently of a human, while ensuring that the improved code - does not regress the properties of the original code? - improves the original in a verifiable and measurable way? To address this question, we advocate Assured LLM-Based Software Engineering; a generate-and-test approach, inspired by Genetic Improvement. Assured LLMSE applies a series of semantic filters that discard code that fails to meet these twin guarantees. This overcomes the potential problem of LLM's propensity to hallucinate. It allows us to generate code using LLMs, independently of any human. The human plays the role only of final code reviewer, as they would do with code generated by other human engineers. This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
