Table of Contents
Fetching ...

Proof2Silicon: Prompt Repair for Verified Code and Hardware Generation via Reinforcement Learning

Manvi Jha, Jiaxin Wan, Deming Chen

TL;DR

Proof2Silicon presents an end-to-end framework that unites PREFACE’s RL-driven prompt repair with a full hardware synthesis pipeline. By iteratively steering a frozen LLM to generate verified Dafny code and converting it through Python and PyLog into HLS C and RTL, the approach demonstrates formally verifiable hardware synthesis from natural language specifications. Key contributions include verifier-guided prompt refinement, an integrated verification loop, and an automated Dafny-to-Vivado HLS flow, achieving up to 72% end-to-end synthesis success on a 100-task benchmark. This work bridges formal verification and hardware design automation, offering a scalable path from high-level specifications to silicon realizations with quantifiable verification and synthesis outcomes.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in automated code generation but frequently produce code that fails formal verification, an essential requirement for hardware and safety-critical domains. To overcome this fundamental limitation, we previously proposed PREFACE, a model-agnostic framework based on reinforcement learning (RL) that iteratively repairs the prompts provided to frozen LLMs, systematically steering them toward generating formally verifiable Dafny code without costly fine-tuning. This work presents Proof2Silicon, a novel end-to-end synthesis framework that embeds the previously proposed PREFACE flow to enable the generation of correctness-by-construction hardware directly from natural language specifications. Proof2Silicon operates by: (1) leveraging PREFACE's verifier-driven RL agent to optimize prompt generation iteratively, ensuring Dafny code correctness; (2) automatically translating verified Dafny programs into synthesizable high-level C using Dafny's Python backend and PyLog; and (3) employing Vivado HLS to produce RTL implementations. Evaluated rigorously on a challenging 100-task benchmark, PREFACE's RL-guided prompt optimization consistently improved Dafny verification success rates across diverse LLMs by up to 21%. Crucially, Proof2Silicon achieved an end-to-end hardware synthesis success rate of up to 72%, generating RTL designs through Vivado HLS synthesis flows. These results demonstrate a robust, scalable, and automated pipeline for LLM-driven, formally verified hardware synthesis, bridging natural-language specification and silicon realization.

Proof2Silicon: Prompt Repair for Verified Code and Hardware Generation via Reinforcement Learning

TL;DR

Proof2Silicon presents an end-to-end framework that unites PREFACE’s RL-driven prompt repair with a full hardware synthesis pipeline. By iteratively steering a frozen LLM to generate verified Dafny code and converting it through Python and PyLog into HLS C and RTL, the approach demonstrates formally verifiable hardware synthesis from natural language specifications. Key contributions include verifier-guided prompt refinement, an integrated verification loop, and an automated Dafny-to-Vivado HLS flow, achieving up to 72% end-to-end synthesis success on a 100-task benchmark. This work bridges formal verification and hardware design automation, offering a scalable path from high-level specifications to silicon realizations with quantifiable verification and synthesis outcomes.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in automated code generation but frequently produce code that fails formal verification, an essential requirement for hardware and safety-critical domains. To overcome this fundamental limitation, we previously proposed PREFACE, a model-agnostic framework based on reinforcement learning (RL) that iteratively repairs the prompts provided to frozen LLMs, systematically steering them toward generating formally verifiable Dafny code without costly fine-tuning. This work presents Proof2Silicon, a novel end-to-end synthesis framework that embeds the previously proposed PREFACE flow to enable the generation of correctness-by-construction hardware directly from natural language specifications. Proof2Silicon operates by: (1) leveraging PREFACE's verifier-driven RL agent to optimize prompt generation iteratively, ensuring Dafny code correctness; (2) automatically translating verified Dafny programs into synthesizable high-level C using Dafny's Python backend and PyLog; and (3) employing Vivado HLS to produce RTL implementations. Evaluated rigorously on a challenging 100-task benchmark, PREFACE's RL-guided prompt optimization consistently improved Dafny verification success rates across diverse LLMs by up to 21%. Crucially, Proof2Silicon achieved an end-to-end hardware synthesis success rate of up to 72%, generating RTL designs through Vivado HLS synthesis flows. These results demonstrate a robust, scalable, and automated pipeline for LLM-driven, formally verified hardware synthesis, bridging natural-language specification and silicon realization.

Paper Structure

This paper contains 25 sections, 1 equation, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of the PREFACE 10.1145/3716368.3735300 framework-forming the prompt‑optimization core embedded within the Proof2Silicon pipeline.
  • Figure 2: Overview of the proposed Proof2Silicon framework.
  • Figure 3: Proof2Silicon end‑to‑end workflow: an RL‑driven Small Language Model (SLM) iteratively refines meta‑prompts to a frozen LLM under guidance from the integrated Dafny verifier, producing formally verified Dafny code. Verified code is then compiled to Python, transformed into hardware‑aware C via the PyLog backend, and synthesized to RTL with Vivado HLS on a Zynq‑7000 SoC.
  • Figure 4: Training curves for the PREFACE pipeline over more than 2000 episodes: (a) Value Loss $\times$ 1000 and Policy Loss, we scale up the Value loss for a better visualization; (b) Episode Reward, illustrates the pipeline's verification performance improving over time.