Table of Contents
Fetching ...

ATLAS: Automated Toolkit for Large-Scale Verified Code Synthesis

Mantas Baksys, Stefan Zetzsche, Olivier Bouissou, Remi Delmas, Soonho Kong

TL;DR

ATLAS introduces a scalable pipeline that automatically generates formally verified Dafny programs, addressing data scarcity by producing contracts, implementations, and proofs at scale. By decomposing the synthesis into specialized tasks and applying soundness and completeness checks, the approach yields a rich multi-task training dataset (~19K examples from 2.7K verified programs) that significantly boosts verification-aware learning when fine-tuning a real-world LLM (Qwen 2.5 7B Coder). Empirical results on DafnyBench and DafnySynthesis show notable improvements over baselines, demonstrating the value of synthetic verified code for enhancing formal verification capabilities in LLMs. The work also provides a structured framework for evaluation and future exploration in spec-focused synthesis, automated proofs, and agentic verification tools.

Abstract

Large language models have shown potential for program verification, but progress is hindered by the scarcity of verified code for training. We present ATLAS, an automated pipeline that synthesizes verified programs at scale to address this data bottleneck. ATLAS generates complete Dafny programs with specifications, implementations, and proofs, producing 2.7K verified programs from which we extract over 19K training examples--more than 7 per verified program--by decomposing the synthesis process into multiple specialized tasks. Fine-tuning Qwen 2.5 7B Coder on this dataset produces substantial gains: +23 percentage points on DafnyBench and +50 percentage points on DafnySynthesis. These results demonstrate that synthetic verified code can effectively enhance LLM capabilities for formal verification.

ATLAS: Automated Toolkit for Large-Scale Verified Code Synthesis

TL;DR

ATLAS introduces a scalable pipeline that automatically generates formally verified Dafny programs, addressing data scarcity by producing contracts, implementations, and proofs at scale. By decomposing the synthesis into specialized tasks and applying soundness and completeness checks, the approach yields a rich multi-task training dataset (~19K examples from 2.7K verified programs) that significantly boosts verification-aware learning when fine-tuning a real-world LLM (Qwen 2.5 7B Coder). Empirical results on DafnyBench and DafnySynthesis show notable improvements over baselines, demonstrating the value of synthetic verified code for enhancing formal verification capabilities in LLMs. The work also provides a structured framework for evaluation and future exploration in spec-focused synthesis, automated proofs, and agentic verification tools.

Abstract

Large language models have shown potential for program verification, but progress is hindered by the scarcity of verified code for training. We present ATLAS, an automated pipeline that synthesizes verified programs at scale to address this data bottleneck. ATLAS generates complete Dafny programs with specifications, implementations, and proofs, producing 2.7K verified programs from which we extract over 19K training examples--more than 7 per verified program--by decomposing the synthesis process into multiple specialized tasks. Fine-tuning Qwen 2.5 7B Coder on this dataset produces substantial gains: +23 percentage points on DafnyBench and +50 percentage points on DafnySynthesis. These results demonstrate that synthetic verified code can effectively enhance LLM capabilities for formal verification.

Paper Structure

This paper contains 26 sections, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: An overview of the ATLAS pipeline.
  • Figure 2: Evaluation of our fine-tuned Qwen 2.5 7B Coder on DafnyBench and DafnySynthesis, compared with results reported in loughridge2024dafnybenchbenchmarkformalsoftware and misu2024towards at the time of their release.
  • Figure 3: A verified Dafny program synthesized by ATLAS, including specification, implementation, and test cases.
  • Figure 4: Soundness and completeness lemmas constructed from the first test case in \ref{['fig:implementation']}.
  • Figure 5: ATLAS pipeline success rate by TACO-verified difficulty ratings and skill types.
  • ...and 2 more figures