Table of Contents
Fetching ...

QiMeng-CodeV-R1: Reasoning-Enhanced Verilog Generation

Yaoyu Zhu, Di Huang, Hanqi Lyu, Xiaoyun Zhang, Chongxiao Li, Wenxuan Shi, Yutong Wu, Jianan Mu, Jinghua Wang, Yang Zhao, Pengwei Jin, Shuyao Cheng, Shengwen Liang, Xishan Zhang, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, Yunji Chen

TL;DR

CodeV-R1 tackles the challenge of generating Verilog from natural-language specifications by integrating a testbench-driven verification loop with a two-stage training pipeline. It introduces automated testbench generation, round-trip NL–code data synthesis, and distill-then-RL training with adaptive sampling (DAPO) to reduce RLVR cost. The resulting CodeV-R1-7B achieves state-of-the-art pass@1 on VerilogEval v2 and RTLLM v1.1, surpassing prior Verilog-domain models and even DeepSeek-R1 on RTLLM, while maintaining computational efficiency. Release of the model, code, and datasets aims to accelerate research in EDA-focused LLMs and robust HDL generation.

Abstract

Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification, such as software programming and mathematical problems. Extending RLVR to electronic design automation (EDA), especially automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges: the lack of automated and accurate verification environments, the scarcity of high-quality NL-code pairs, and the prohibitive computation cost of RLVR. To this end, we introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs. First, we develop a rule-based testbench generator that performs robust equivalence checking against golden references. Second, we propose a round-trip data synthesis method that pairs open-source Verilog snippets with LLM-generated NL descriptions, verifies code-NL-code consistency via the generated testbench, and filters out inequivalent examples to yield a high-quality dataset. Third, we employ a two-stage "distill-then-RL" training pipeline: distillation for the cold start of reasoning abilities, followed by adaptive DAPO, our novel RLVR algorithm that can reduce training cost by adaptively adjusting sampling rate. The resulting model, CodeV-R1-7B, achieves 68.6% and 72.9% pass@1 on VerilogEval v2 and RTLLM v1.1, respectively, surpassing prior state-of-the-art by 12~20%, while even exceeding the performance of 671B DeepSeek-R1 on RTLLM. We have released our model, training code, and dataset to facilitate research in EDA and LLM communities.

QiMeng-CodeV-R1: Reasoning-Enhanced Verilog Generation

TL;DR

CodeV-R1 tackles the challenge of generating Verilog from natural-language specifications by integrating a testbench-driven verification loop with a two-stage training pipeline. It introduces automated testbench generation, round-trip NL–code data synthesis, and distill-then-RL training with adaptive sampling (DAPO) to reduce RLVR cost. The resulting CodeV-R1-7B achieves state-of-the-art pass@1 on VerilogEval v2 and RTLLM v1.1, surpassing prior Verilog-domain models and even DeepSeek-R1 on RTLLM, while maintaining computational efficiency. Release of the model, code, and datasets aims to accelerate research in EDA-focused LLMs and robust HDL generation.

Abstract

Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification, such as software programming and mathematical problems. Extending RLVR to electronic design automation (EDA), especially automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges: the lack of automated and accurate verification environments, the scarcity of high-quality NL-code pairs, and the prohibitive computation cost of RLVR. To this end, we introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs. First, we develop a rule-based testbench generator that performs robust equivalence checking against golden references. Second, we propose a round-trip data synthesis method that pairs open-source Verilog snippets with LLM-generated NL descriptions, verifies code-NL-code consistency via the generated testbench, and filters out inequivalent examples to yield a high-quality dataset. Third, we employ a two-stage "distill-then-RL" training pipeline: distillation for the cold start of reasoning abilities, followed by adaptive DAPO, our novel RLVR algorithm that can reduce training cost by adaptively adjusting sampling rate. The resulting model, CodeV-R1-7B, achieves 68.6% and 72.9% pass@1 on VerilogEval v2 and RTLLM v1.1, respectively, surpassing prior state-of-the-art by 12~20%, while even exceeding the performance of 671B DeepSeek-R1 on RTLLM. We have released our model, training code, and dataset to facilitate research in EDA and LLM communities.

Paper Structure

This paper contains 35 sections, 1 theorem, 7 equations, 8 figures, 10 tables, 1 algorithm.

Key Result

Theorem 2.1

Consider the probabilistic models $M_1: \mathcal{F} \to \mathcal{L}$ (code-to-NL) and $M_2: \mathcal{L} \to \mathcal{F}$ (NL-to-code) from the NL-Code Deterministic Equivalence (NLCDE) definition (Definition def:nlcde). Let $Y \in \mathcal{F}$ be a random code snippet drawn from some distribution, a

Figures (8)

  • Figure 1: The overview of CodeV-R1. The core components of our framework include an automated testbench (Section \ref{['sec:equivalence_checking']}), a supervised fine-tuning process (Section \ref{['sec:distillation']}), and a reinforcement learning process (Section \ref{['sec:reinforcement_learning']}).
  • Figure 2: Test-time scaling on RTLLM v1.1. Figure (a) shows response length against accuracy, while Figure (b) shows FLOPs against accuracy. FLOPs are estimated according to model architecture.
  • Figure 3: Train-time scale up on some key metrics. Figure (a) tracks response length, whereas Figure (b) presents the corresponding trend for reward.
  • Figure 4: Time comparison between adaptive DAPO and baseline DAPO.(a): Comparison of RL training time per step. (b): Acceleration ratio between adaptive DAPO and baseline DAPO, breakdown by step (whether before 150).
  • Figure 5: T-SNE distribution of CodeV-R1 RL dataset, RTLLM (v2), and VerilogEval (v2 spec-to-RTL). Left: Problem (NL) distribution; Right: Solution (code) distribution.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Definition 2.1: NL-Code Deterministic Equivalence (NLCDE)
  • Theorem 2.1: Semantic Equivalence in Round-Trip Transformations
  • proof : Proof Sketch
  • proof