ATLAS: Automated Toolkit for Large-Scale Verified Code Synthesis
Mantas Baksys, Stefan Zetzsche, Olivier Bouissou, Remi Delmas, Soonho Kong
TL;DR
ATLAS introduces a scalable pipeline that automatically generates formally verified Dafny programs, addressing data scarcity by producing contracts, implementations, and proofs at scale. By decomposing the synthesis into specialized tasks and applying soundness and completeness checks, the approach yields a rich multi-task training dataset (~19K examples from 2.7K verified programs) that significantly boosts verification-aware learning when fine-tuning a real-world LLM (Qwen 2.5 7B Coder). Empirical results on DafnyBench and DafnySynthesis show notable improvements over baselines, demonstrating the value of synthetic verified code for enhancing formal verification capabilities in LLMs. The work also provides a structured framework for evaluation and future exploration in spec-focused synthesis, automated proofs, and agentic verification tools.
Abstract
Large language models have shown potential for program verification, but progress is hindered by the scarcity of verified code for training. We present ATLAS, an automated pipeline that synthesizes verified programs at scale to address this data bottleneck. ATLAS generates complete Dafny programs with specifications, implementations, and proofs, producing 2.7K verified programs from which we extract over 19K training examples--more than 7 per verified program--by decomposing the synthesis process into multiple specialized tasks. Fine-tuning Qwen 2.5 7B Coder on this dataset produces substantial gains: +23 percentage points on DafnyBench and +50 percentage points on DafnySynthesis. These results demonstrate that synthetic verified code can effectively enhance LLM capabilities for formal verification.
