VerilogEval: Evaluating Large Language Models for Verilog Code Generation

Mingjie Liu; Nathaniel Pinckney; Brucek Khailany; Haoxing Ren

VerilogEval: Evaluating Large Language Models for Verilog Code Generation

Mingjie Liu, Nathaniel Pinckney, Brucek Khailany, Haoxing Ren

TL;DR

VerilogEval introduces a targeted benchmarking framework to evaluate large-language-model-driven Verilog code generation, using 156 HDLBits problems and automated functional testing to ensure correctness. The framework distinguishes machine-generated and human-curated problem descriptions and demonstrates that synthetic supervised fine-tuning can improve generation quality, especially for machine-provided prompts. Key findings include the effectiveness of SFT with larger models, the importance of data quality, and trade-offs in training epochs, with implications for scalable, domain-specific code generation in hardware design. The work highlights both the promise and current limits of using LLMs for boilerplate Verilog generation and points to richer problem modalities and alignment techniques as future directions.

Abstract

The increasing popularity of large language models (LLMs) has paved the way for their application in diverse domains. This paper proposes a benchmarking framework tailored specifically for evaluating LLM performance in the context of Verilog code generation for hardware design and verification. We present a comprehensive evaluation dataset consisting of 156 problems from the Verilog instructional website HDLBits. The evaluation set consists of a diverse set of Verilog code generation tasks, ranging from simple combinational circuits to complex finite state machines. The Verilog code completions can be automatically tested for functional correctness by comparing the transient simulation outputs of the generated design with a golden solution. We also demonstrate that the Verilog code generation capability of pretrained language models could be improved with supervised fine-tuning by bootstrapping with LLM generated synthetic problem-code pairs.

VerilogEval: Evaluating Large Language Models for Verilog Code Generation

TL;DR

Abstract

Paper Structure (16 sections, 1 equation, 9 figures, 4 tables)

This paper contains 16 sections, 1 equation, 9 figures, 4 tables.

Introduction
Evaluation Framework
VerilogEval Evaluation Set
Problem Descriptions
VerilogEval-machine
VerilogEval-human
Automated Testing Environment
Evaluation Metric
Supervised Fine-Tuning
Synthetic SFT Data Generation
Results on Supervised Fine-tuning
Training Epochs
Model Size and Base Model
SFT Data Quality
Limitations and Future Directions
...and 1 more sections

Figures (9)

Figure 1: VerilogEval uses a sandbox environment for simple and reproducible evaluation of LLM Verilog code generation
Figure 2: Example of vectorr in VerilogEval-human. The Problem Description includes both natural language description and module header, input, and output definition.
Figure 3: VerilogEval-machine use gpt-3.5-turbo to generate problem descriptions for 2012_q2b.
Figure 4: ChatGPT guidance on state transition diagrams.
Figure 5: Examples of VerilogEval-human descriptions. We show original website descriptions alongside manually converted text format.
...and 4 more figures

VerilogEval: Evaluating Large Language Models for Verilog Code Generation

TL;DR

Abstract

VerilogEval: Evaluating Large Language Models for Verilog Code Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)