Table of Contents
Fetching ...

Insights from Verification: Training a Verilog Generation LLM with Reinforcement Learning with Testbench Feedback

Ning Wang, Bingkun Yao, Jie Zhou, Yuchen Hu, Xi Wang, Nan Guan, Zhe Jiang

TL;DR

This work tackles the core problem of functional correctness in Verilog generation by introducing verification-driven training for Verilog-generation LLMs. It presents VeriPrefer, a two-stage framework combining supervised fine-tuning on realistic specifications with reinforcement learning driven by automatically generated testbenches and Verilog compiler feedback, using direct preference optimization (DPO) to learn from pairwise code quality signals. The approach constructs a decomposition-based testbench pipeline (Analyze, Draft, Improve, Rectify) that, together with VCS feedback, yields preferred code that passes more testbenches, enabling functional alignment beyond token likelihood. Experimental results on multiple benchmarks (VerilogEval-Machine/Human, RTLLM v1.1/v2, VerilogEval v2) show consistent gains over state-of-the-art baselines, with notable generalization across model families and sizes. The work releases all code, data, and models, providing a practical path toward deployable, functionally correct HDL generation and establishing a strong baseline for verification-guided LLM training in hardware design.

Abstract

Large language models (LLMs) have shown strong performance in Verilog generation from natural language description. However, ensuring the functional correctness of the generated code remains a significant challenge. This paper introduces a method that integrates verification insights from testbench into the training of Verilog generation LLMs, aligning the training with the fundamental goal of hardware design: functional correctness. The main obstacle in using LLMs for Verilog code generation is the lack of sufficient functional verification data, particularly testbenches paired with design specifications and code. To address this problem, we introduce an automatic testbench generation pipeline that decomposes the process and uses feedback from the Verilog compiler simulator (VCS) to reduce hallucination and ensure correctness. We then use the testbench to evaluate the generated codes and collect them for further training, where verification insights are introduced. Our method applies reinforcement learning (RL), specifically direct preference optimization (DPO), to align Verilog code generation with functional correctness by training preference pairs based on testbench outcomes. In evaluations on VerilogEval-Machine, VerilogEval-Human, RTLLM v1.1, RTLLM v2, and VerilogEval v2, our approach consistently outperforms state-of-the-art baselines in generating functionally correct Verilog code. We open source all training code, data, and models at https://anonymous.4open.science/r/VeriPrefer-E88B.

Insights from Verification: Training a Verilog Generation LLM with Reinforcement Learning with Testbench Feedback

TL;DR

This work tackles the core problem of functional correctness in Verilog generation by introducing verification-driven training for Verilog-generation LLMs. It presents VeriPrefer, a two-stage framework combining supervised fine-tuning on realistic specifications with reinforcement learning driven by automatically generated testbenches and Verilog compiler feedback, using direct preference optimization (DPO) to learn from pairwise code quality signals. The approach constructs a decomposition-based testbench pipeline (Analyze, Draft, Improve, Rectify) that, together with VCS feedback, yields preferred code that passes more testbenches, enabling functional alignment beyond token likelihood. Experimental results on multiple benchmarks (VerilogEval-Machine/Human, RTLLM v1.1/v2, VerilogEval v2) show consistent gains over state-of-the-art baselines, with notable generalization across model families and sizes. The work releases all code, data, and models, providing a practical path toward deployable, functionally correct HDL generation and establishing a strong baseline for verification-guided LLM training in hardware design.

Abstract

Large language models (LLMs) have shown strong performance in Verilog generation from natural language description. However, ensuring the functional correctness of the generated code remains a significant challenge. This paper introduces a method that integrates verification insights from testbench into the training of Verilog generation LLMs, aligning the training with the fundamental goal of hardware design: functional correctness. The main obstacle in using LLMs for Verilog code generation is the lack of sufficient functional verification data, particularly testbenches paired with design specifications and code. To address this problem, we introduce an automatic testbench generation pipeline that decomposes the process and uses feedback from the Verilog compiler simulator (VCS) to reduce hallucination and ensure correctness. We then use the testbench to evaluate the generated codes and collect them for further training, where verification insights are introduced. Our method applies reinforcement learning (RL), specifically direct preference optimization (DPO), to align Verilog code generation with functional correctness by training preference pairs based on testbench outcomes. In evaluations on VerilogEval-Machine, VerilogEval-Human, RTLLM v1.1, RTLLM v2, and VerilogEval v2, our approach consistently outperforms state-of-the-art baselines in generating functionally correct Verilog code. We open source all training code, data, and models at https://anonymous.4open.science/r/VeriPrefer-E88B.

Paper Structure

This paper contains 32 sections, 4 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Overview of our work. We first use paired design specification and Verilog code to automatically generate testbenches. Then we prompt the fine-tuned model and test the generated code using the testbenches to collect verification insights. The code that passes more testcases is considered as preferred, and the other as less preferred. Finally, the design specification with the preference pairs is used for reinforcement learning.
  • Figure 2: Automatic testbench generation pipeline.
  • Figure 3: Line coverage report example. Red box indicates the total line coverage percentage. 1/1 before the line means it is covered by the testbench and 0/1 means not.
  • Figure 4: Simulation output example.
  • Figure 5: An example of generated testbench.
  • ...and 2 more figures