Table of Contents
Fetching ...

Assured LLM-Based Software Engineering

Nadia Alshahwan, Mark Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, Eddy Wang

TL;DR

The paper argues for Assured LLM-Based Software Engineering (Assured LLMSE), a generate-and-test paradigm that uses semantic filters to ensure LLM-generated code maintains original properties while achieving verifiable improvements, primarily in offline settings due to the need for assurances. It frames this as a Genetic Improvement problem where LLMs generate candidate changes and SBSE guides prompt optimization, yielding both improved code and an evolving prompting strategy. The discussion covers local versus global performance optimization, with detailed examination of refactoring and debugging use cases, all underpinned by automated regression testing as the assurance oracle. The work identifies open research directions for migrating offline assurances online, creating efficient filters and prompting languages, and developing scalable, domain-aware search strategies to make assured LLMSE practical in real-world software engineering. Overall, the approach promises verifiable, autonomous code improvement with structured feedback loops between testing, prompting, and LLM reasoning, enabling safer deployment of AI-assisted software engineering.

Abstract

In this paper we address the following question: How can we use Large Language Models (LLMs) to improve code independently of a human, while ensuring that the improved code - does not regress the properties of the original code? - improves the original in a verifiable and measurable way? To address this question, we advocate Assured LLM-Based Software Engineering; a generate-and-test approach, inspired by Genetic Improvement. Assured LLMSE applies a series of semantic filters that discard code that fails to meet these twin guarantees. This overcomes the potential problem of LLM's propensity to hallucinate. It allows us to generate code using LLMs, independently of any human. The human plays the role only of final code reviewer, as they would do with code generated by other human engineers. This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.

Assured LLM-Based Software Engineering

TL;DR

The paper argues for Assured LLM-Based Software Engineering (Assured LLMSE), a generate-and-test paradigm that uses semantic filters to ensure LLM-generated code maintains original properties while achieving verifiable improvements, primarily in offline settings due to the need for assurances. It frames this as a Genetic Improvement problem where LLMs generate candidate changes and SBSE guides prompt optimization, yielding both improved code and an evolving prompting strategy. The discussion covers local versus global performance optimization, with detailed examination of refactoring and debugging use cases, all underpinned by automated regression testing as the assurance oracle. The work identifies open research directions for migrating offline assurances online, creating efficient filters and prompting languages, and developing scalable, domain-aware search strategies to make assured LLMSE practical in real-world software engineering. Overall, the approach promises verifiable, autonomous code improvement with structured feedback loops between testing, prompting, and LLM reasoning, enabling safer deployment of AI-assisted software engineering.

Abstract

In this paper we address the following question: How can we use Large Language Models (LLMs) to improve code independently of a human, while ensuring that the improved code - does not regress the properties of the original code? - improves the original in a verifiable and measurable way? To address this question, we advocate Assured LLM-Based Software Engineering; a generate-and-test approach, inspired by Genetic Improvement. Assured LLMSE applies a series of semantic filters that discard code that fails to meet these twin guarantees. This overcomes the potential problem of LLM's propensity to hallucinate. It allows us to generate code using LLMs, independently of any human. The human plays the role only of final code reviewer, as they would do with code generated by other human engineers. This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
Paper Structure (10 sections, 1 figure)

This paper contains 10 sections, 1 figure.

Figures (1)

  • Figure 1: Top level comparison between Assured and Non-Assured Large Language Model Software Engineering. In the assured mode, there is a whole infrastructure phase for implementing 'Assurance by Analysis and Manipulation'. This assurance phase pre-processes and post-processes the initial code produced by the language model, passing it through a set of filters. Code that passes through all these filters meets the measurable assurance guarantees denoted by the filters and is passed on to the code consumer, which may be a human or another automated tool. Code that fails any of the filters may undergo a follow-up repair process, and/or may occasion LLM re-prompting and prompt optimisation using SBSE. The repair and re-prompting steps are optional. In general, code that fails any of the filters (and cannot be repaired or re-prompted) will be discarded. By comparison, Non-Assured LLMSE simply passes the initial code generated in response to an LLM prompt directly to the code consumer, and offers no guarantee; the code may not even compile.

Theorems & Definitions (7)

  • Definition 1: LLM-Based Software Engineering (LLMSE)
  • Definition 2: LLM Application
  • Definition 3: LLM Consumer
  • Definition 4: Real Time
  • Definition 5: Online LLM-Based Software Engineering (Online LLMSE)
  • Definition 6: Offline LLMSE
  • Definition 7: Assured LLMSE