Proof2Silicon: Prompt Repair for Verified Code and Hardware Generation via Reinforcement Learning

Manvi Jha; Jiaxin Wan; Deming Chen

Proof2Silicon: Prompt Repair for Verified Code and Hardware Generation via Reinforcement Learning

Manvi Jha, Jiaxin Wan, Deming Chen

TL;DR

Proof2Silicon presents an end-to-end framework that unites PREFACE’s RL-driven prompt repair with a full hardware synthesis pipeline. By iteratively steering a frozen LLM to generate verified Dafny code and converting it through Python and PyLog into HLS C and RTL, the approach demonstrates formally verifiable hardware synthesis from natural language specifications. Key contributions include verifier-guided prompt refinement, an integrated verification loop, and an automated Dafny-to-Vivado HLS flow, achieving up to 72% end-to-end synthesis success on a 100-task benchmark. This work bridges formal verification and hardware design automation, offering a scalable path from high-level specifications to silicon realizations with quantifiable verification and synthesis outcomes.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in automated code generation but frequently produce code that fails formal verification, an essential requirement for hardware and safety-critical domains. To overcome this fundamental limitation, we previously proposed PREFACE, a model-agnostic framework based on reinforcement learning (RL) that iteratively repairs the prompts provided to frozen LLMs, systematically steering them toward generating formally verifiable Dafny code without costly fine-tuning. This work presents Proof2Silicon, a novel end-to-end synthesis framework that embeds the previously proposed PREFACE flow to enable the generation of correctness-by-construction hardware directly from natural language specifications. Proof2Silicon operates by: (1) leveraging PREFACE's verifier-driven RL agent to optimize prompt generation iteratively, ensuring Dafny code correctness; (2) automatically translating verified Dafny programs into synthesizable high-level C using Dafny's Python backend and PyLog; and (3) employing Vivado HLS to produce RTL implementations. Evaluated rigorously on a challenging 100-task benchmark, PREFACE's RL-guided prompt optimization consistently improved Dafny verification success rates across diverse LLMs by up to 21%. Crucially, Proof2Silicon achieved an end-to-end hardware synthesis success rate of up to 72%, generating RTL designs through Vivado HLS synthesis flows. These results demonstrate a robust, scalable, and automated pipeline for LLM-driven, formally verified hardware synthesis, bridging natural-language specification and silicon realization.

Proof2Silicon: Prompt Repair for Verified Code and Hardware Generation via Reinforcement Learning

TL;DR

Abstract

Proof2Silicon: Prompt Repair for Verified Code and Hardware Generation via Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)