Table of Contents
Fetching ...

Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming

Saikat Chakraborty, Gabriel Ebner, Siddharth Bhat, Sarah Fakhoury, Sakina Fatima, Shuvendu Lahiri, Nikhil Swamy

TL;DR

This work tackles AI-assisted synthesis of proof-oriented programs in the dependently typed language F* by releasing FStarDataSet, a large corpus of SMT-assisted programs and proofs drawn from production-grade projects. It formalizes a type-directed synthesis benchmark where the goal is to generate a program that satisfies a given type and evaluates retrieval-augmented prompts with fine-tuned small models versus large LLMs. The results show that fine-tuned smaller models can match or outperform large models at a fraction of the cost, especially when aided by retrieval and premise selection, and that prompt design and token budgets materially affect performance. The dataset and findings advance trustworthy AI programming and motivate further work on specification formulation, modular proof decomposition, and broader language coverage.

Abstract

Proof-oriented programs mix computational content with proofs of program correctness. However, the human effort involved in programming and proving is still substantial, despite the use of Satisfiability Modulo Theories (SMT) solvers to automate proofs in languages such as F*. Seeking to spur research on using AI to automate the construction of proof-oriented programs, we curate a dataset of 600K lines of open-source F* programs and proofs, including software used in production systems ranging from Windows and Linux to Python and Firefox. Our dataset includes around 32K top-level F* definitions, each representing a type-directed program and proof synthesis problem producing a definition given a formal specification expressed as an F* type. We provide a program fragment checker that queries F* to check the correctness of candidate solutions. We also report on an extended version of our dataset containing a total of 940K lines of programs and proofs, with a total of 54k top-level F* definitions. We believe this is the largest corpus of SMT-assisted program proofs coupled with a reproducible program-fragment checker. Grounded in this dataset, we investigate the use of AI to synthesize programs and their proofs in F*, with promising results. Our main finding in that the performance of fine-tuned smaller language models (such as Phi-2 or StarCoder) compare favorably with large language models (such as GPT-4), at a much lower computational cost. We also identify various type-based retrieval augmentation techniques and find that they boost performance significantly. With detailed error analysis and case studies, we identify potential strengths and weaknesses of models and techniques and suggest directions for future improvements.

Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming

TL;DR

This work tackles AI-assisted synthesis of proof-oriented programs in the dependently typed language F* by releasing FStarDataSet, a large corpus of SMT-assisted programs and proofs drawn from production-grade projects. It formalizes a type-directed synthesis benchmark where the goal is to generate a program that satisfies a given type and evaluates retrieval-augmented prompts with fine-tuned small models versus large LLMs. The results show that fine-tuned smaller models can match or outperform large models at a fraction of the cost, especially when aided by retrieval and premise selection, and that prompt design and token budgets materially affect performance. The dataset and findings advance trustworthy AI programming and motivate further work on specification formulation, modular proof decomposition, and broader language coverage.

Abstract

Proof-oriented programs mix computational content with proofs of program correctness. However, the human effort involved in programming and proving is still substantial, despite the use of Satisfiability Modulo Theories (SMT) solvers to automate proofs in languages such as F*. Seeking to spur research on using AI to automate the construction of proof-oriented programs, we curate a dataset of 600K lines of open-source F* programs and proofs, including software used in production systems ranging from Windows and Linux to Python and Firefox. Our dataset includes around 32K top-level F* definitions, each representing a type-directed program and proof synthesis problem producing a definition given a formal specification expressed as an F* type. We provide a program fragment checker that queries F* to check the correctness of candidate solutions. We also report on an extended version of our dataset containing a total of 940K lines of programs and proofs, with a total of 54k top-level F* definitions. We believe this is the largest corpus of SMT-assisted program proofs coupled with a reproducible program-fragment checker. Grounded in this dataset, we investigate the use of AI to synthesize programs and their proofs in F*, with promising results. Our main finding in that the performance of fine-tuned smaller language models (such as Phi-2 or StarCoder) compare favorably with large language models (such as GPT-4), at a much lower computational cost. We also identify various type-based retrieval augmentation techniques and find that they boost performance significantly. With detailed error analysis and case studies, we identify potential strengths and weaknesses of models and techniques and suggest directions for future improvements.
Paper Structure (19 sections, 7 figures, 8 tables)

This paper contains 19 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Overview of our experimental approach for synthesizing F$^\star$ definitions.
  • Figure 2: Venn Diagram showing intersection of examples from Cross-Project examples solved by different models. The GPT-* performance in Fig. (b) are from prompting respective models.
  • Figure 3: Verify@10 across different types of examples for different models.
  • Figure 4: Different errors spawned from the generated definitions by different models, presented as percentage of all errors.
  • Figure 5: Normalized Levenshtein distance between verified solutions and ground truth across different classes.
  • ...and 2 more figures