Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarcity
Dylan Zhang, Justin Wang, Tianran Sun
TL;DR
This work tackles the data scarcity barrier in proof-oriented programming by proposing a data-centric post-training approach that synthesizes function-level and project-level tasks in F*. PoPilot, a 14B parameter model, is trained on diverse synthetic data plus cross-language instruction-tuning to learn both proof synthesis and repair. The approach yields strong project-level proof generation and repair performance, surpassing GPT-4o by 64% in generation and enabling up to 54% improvement in GPT-4o via repair, demonstrating the value of synthetic and diversified data for low-resource formal verification. The results suggest PoPilot can serve as a practical verification assistant and as a debugging companion for larger models, advancing the adoption of formal verification in real-world software development.
Abstract
Existing LMs struggle with proof-oriented programming due to data scarcity, which manifest in two key ways: (1) a lack of sufficient corpora for proof-oriented programming languages such as F*, and (2) the absence of large-scale, project-level proof-oriented implementations that can teach the model the intricate reasoning process when performing proof-oriented programming. We present the first on synthetic data augmentation for project level proof oriented programming for both generation and repair. Our method addresses data scarcity by synthesizing basic proof-oriented programming problems for proficiency in that language; incorporating diverse coding data for reasoning capability elicitation and creating new proofs and repair data within existing repositories. This approach enables language models to both synthesize and repair proofs for function- and repository-level code. We show that our fine-tuned 14B parameter model, PoPilot, can exceed the performance of the models that outperforms GPT-4o in project-level proof-oriented programming by 64% relative margin, and can improve GPT-4o's performance by 54% by repairing its outputs over GPT-4o's self-repair.
