Table of Contents
Fetching ...

Effective LLM-Driven Code Generation with Pythoness

Kyla H. Levin, Kyle Gwilt, Emery D. Berger, Stephen N. Freund

TL;DR

Pythoness introduces an embedded DSL that lets developers express desired behavior through natural-language descriptions and formal or informal tests, then uses an LLM to generate code that passes those tests. It adds validation and repair loops (including Hypothesis-based property testing) and caches validated code to ensure robustness across runs. Through a LeetCode example, the authors demonstrate how tests dramatically improve code quality compared to prompts alone, and outline future work on run-time checks, performance specifications, and cross-function correctness. This approach aims to combine the speed of LLM-driven generation with rigorous guardrails to produce reliable, maintainable code in real-world development workflows.

Abstract

The advent of large language models (LLMs) has paved the way for a new era of programming tools with both significant capabilities and risks, as the generated code lacks guarantees of correctness and reliability. Developers using LLMs currently face the difficult task of optimizing, integrating, and maintaining code generated by AI. We propose an embedded domain-specific language (DSL), Pythoness, to address those challenges. In Pythoness, developers program with LLMs at a higher level of abstraction. Rather than interacting directly with generated code, developers using Pythoness operate at the level of behavioral specifications when writing functions, classes, or an entire program. These specifications can take the form of unit tests and property-based tests, which may be expressed formally or in natural language. Guided by these specifications, Pythoness generates code that both passes the tests and can be continuously checked during execution. We posit that the Pythoness approach lets developers harness the full potential of LLMs for code generation while substantially mitigating their inherent risks. We describe our current prototype implementation of Pythoness and demonstrate that it can successfully leverage a combination of tests and code generation to yield higher quality code than specifications alone.

Effective LLM-Driven Code Generation with Pythoness

TL;DR

Pythoness introduces an embedded DSL that lets developers express desired behavior through natural-language descriptions and formal or informal tests, then uses an LLM to generate code that passes those tests. It adds validation and repair loops (including Hypothesis-based property testing) and caches validated code to ensure robustness across runs. Through a LeetCode example, the authors demonstrate how tests dramatically improve code quality compared to prompts alone, and outline future work on run-time checks, performance specifications, and cross-function correctness. This approach aims to combine the speed of LLM-driven generation with rigorous guardrails to produce reliable, maintainable code in real-world development workflows.

Abstract

The advent of large language models (LLMs) has paved the way for a new era of programming tools with both significant capabilities and risks, as the generated code lacks guarantees of correctness and reliability. Developers using LLMs currently face the difficult task of optimizing, integrating, and maintaining code generated by AI. We propose an embedded domain-specific language (DSL), Pythoness, to address those challenges. In Pythoness, developers program with LLMs at a higher level of abstraction. Rather than interacting directly with generated code, developers using Pythoness operate at the level of behavioral specifications when writing functions, classes, or an entire program. These specifications can take the form of unit tests and property-based tests, which may be expressed formally or in natural language. Guided by these specifications, Pythoness generates code that both passes the tests and can be continuously checked during execution. We posit that the Pythoness approach lets developers harness the full potential of LLMs for code generation while substantially mitigating their inherent risks. We describe our current prototype implementation of Pythoness and demonstrate that it can successfully leverage a combination of tests and code generation to yield higher quality code than specifications alone.
Paper Structure (9 sections, 4 figures, 1 table)

This paper contains 9 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Pythoness lets developers integrate high quality LLM-generated code with manually written code. On the automated side of the space, LLM conversations (e.g. ChatGPT) produce code with no guarantees of quality. On the traditionally written code side, developers can also produce code of varying quality. Pythoness utilizes both approaches to produce validated, high-quality code.
  • Figure 2: Overview of the software architecture of our Pythoness prototype. When a Pythoness-decorated function is called for the first time, the Pythoness prototype generates code via an LLM, checks the code against provided tests, and caches validated code in a database for future use. If the code fails the tests or compilation, the prototype attempts to regenerate the code until it passes all tests reports a failure.
  • Figure 3: The Pythoness header used to generate the code in Figure \ref{['fig:answers']}(b). For brevity, the docstring paraphrases the full #3350 problem description on LeetCode.
  • Figure 4: A comparison of the code produced by Pythoness with and without unit tests. Without any tests, Pythoness produces the noticeably faulty code in Figure \ref{['fig:answer-bad']} that only passes 469 of the 1,111 tests on LeetCode. When provided with a set of unit tests, Pythoness generates the improved code in Figure \ref{['fig:answer-good']} that successfully passes all the LeetCode tests.