Table of Contents
Fetching ...

VRank: Enhancing Verilog Code Generation from Large Language Models via Self-Consistency

Zhuorui Zhao, Ruidi Qiu, Ing-Chao Lin, Grace Li Zhang, Bing Li, Ulf Schlichtmann

TL;DR

This work addresses the quality gap in Verilog code generation by LLMs through VRank, an automatic framework that samples multiple code candidates, clusters them by identical simulated outputs on an LLM-generated testbench, and ranks clusters by consistency. A Chain-of-Thought step then resolves remaining inconsistencies to select the best candidate, enabling fully automated generation without human testbenches. Empirical evaluation on VerilogEval-Human across multiple models shows an average Pass@1 improvement of $10.5\%$, with notable gains for weaker baselines and robust performance across sample sizes as small as 5. The approach leverages self-consistency (in simulation outputs and reasoning) and MBR-inspired scoring to effectively identify functionally correct Verilog modules, offering a practical, scalable solution for automated hardware design tasks.

Abstract

Large Language Models (LLMs) have demonstrated promising capabilities in generating Verilog code from module specifications. To improve the quality of such generated Verilog codes, previous methods require either time-consuming manual inspection or generation of multiple Verilog codes, from which the one with the highest quality is selected with manually designed testbenches. To enhance the generation efficiency while maintaining the quality of the generated codes, we propose VRank, an automatic framework that generates Verilog codes with LLMs. In our framework, multiple code candidates are generated with LLMs by leveraging their probabilistic nature. Afterwards, we group Verilog code candidates into clusters based on identical outputs when tested against the same testbench, which is also generated by LLMs. Clusters are ranked based on the consistency they show on testbench. To determine the best candidate, Chain-of-Thought is further applied to select the best candidate from the top-ranked clusters. By systematically analyzing diverse outputs of generated codes, VRank reduces errors and enhances the overall quality of the generated Verilog code. Experimental results on the VerilogEval-Human benchmark demonstrate a significant 10.5% average increase in functional correctness (passl1) across multiple LLMs, demonstrating VRank's effectiveness in improving the accuracy of automated hardware description language generation for complex design tasks.

VRank: Enhancing Verilog Code Generation from Large Language Models via Self-Consistency

TL;DR

This work addresses the quality gap in Verilog code generation by LLMs through VRank, an automatic framework that samples multiple code candidates, clusters them by identical simulated outputs on an LLM-generated testbench, and ranks clusters by consistency. A Chain-of-Thought step then resolves remaining inconsistencies to select the best candidate, enabling fully automated generation without human testbenches. Empirical evaluation on VerilogEval-Human across multiple models shows an average Pass@1 improvement of , with notable gains for weaker baselines and robust performance across sample sizes as small as 5. The approach leverages self-consistency (in simulation outputs and reasoning) and MBR-inspired scoring to effectively identify functionally correct Verilog modules, offering a practical, scalable solution for automated hardware design tasks.

Abstract

Large Language Models (LLMs) have demonstrated promising capabilities in generating Verilog code from module specifications. To improve the quality of such generated Verilog codes, previous methods require either time-consuming manual inspection or generation of multiple Verilog codes, from which the one with the highest quality is selected with manually designed testbenches. To enhance the generation efficiency while maintaining the quality of the generated codes, we propose VRank, an automatic framework that generates Verilog codes with LLMs. In our framework, multiple code candidates are generated with LLMs by leveraging their probabilistic nature. Afterwards, we group Verilog code candidates into clusters based on identical outputs when tested against the same testbench, which is also generated by LLMs. Clusters are ranked based on the consistency they show on testbench. To determine the best candidate, Chain-of-Thought is further applied to select the best candidate from the top-ranked clusters. By systematically analyzing diverse outputs of generated codes, VRank reduces errors and enhances the overall quality of the generated Verilog code. Experimental results on the VerilogEval-Human benchmark demonstrate a significant 10.5% average increase in functional correctness (passl1) across multiple LLMs, demonstrating VRank's effectiveness in improving the accuracy of automated hardware description language generation for complex design tasks.

Paper Structure

This paper contains 18 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparison between (a) direct sampling, (b) debugging, and (c) VRank(proposed). Our methods involve no human-in-loop, while achieving significant improvement on Pass@1 accuracy.
  • Figure 2: The outline of VRank. Our framework contains three major steps. (a) Execution-based clustering, (b) Cluster ranking, and (c) CoT decision.
  • Figure 3: generating Verilog code candidates from LLM
  • Figure 4: generating reference signals of given scenario from LLM
  • Figure 5: Functional correctness increase as # Samples increase
  • ...and 1 more figures