Revisiting VerilogEval: A Year of Improvements in Large-Language Models for Hardware Code Generation

Nathaniel Pinckney; Christopher Batten; Mingjie Liu; Haoxing Ren; Brucek Khailany

Revisiting VerilogEval: A Year of Improvements in Large-Language Models for Hardware Code Generation

Nathaniel Pinckney, Christopher Batten, Mingjie Liu, Haoxing Ren, Brucek Khailany

TL;DR

The paper analyzes progress in large-language models applied to hardware code generation by revisiting VerilogEval and introducing VerilogEval v2, which adds specification-to-RTL support, in-context learning, failure classification, and an infrastructure overhaul. It evaluates a broad set of models, including GPT-4o, GPT-4 Turbo, Llama3.1 variants, RTL-Coder, and DeepSeek, across code completion and specification-to-RTL tasks, revealing that open models can match or approach closed-model performance, with prompt tuning playing a crucial role. Key findings show GPT-4o achieving leading performance in 1-shot settings, Llama3.1 405B closing gaps as an open model, and RTL-specific models offering competitive results at smaller sizes, while ICL effects are highly model- and task-dependent. The enhanced benchmark and failure-classification framework provide granular insights into error modes, enabling targeted prompt engineering and laying groundwork for future agent-based, multi-turn hardware design workflows. Overall, the work underscores the importance of task-aligned evaluation, scalable infrastructure, and richer feedback mechanisms to advance LLM-assisted hardware design toward practical deployment.

Abstract

The application of large-language models (LLMs) to digital hardware code generation is an emerging field, with most LLMs primarily trained on natural language and software code. Hardware code like Verilog constitutes a small portion of training data, and few hardware benchmarks exist. The open-source VerilogEval benchmark, released in November 2023, provided a consistent evaluation framework for LLMs on code completion tasks. Since then, both commercial and open models have seen significant development. In this work, we evaluate new commercial and open models since VerilogEval's original release-including GPT-4o, GPT-4 Turbo, Llama3.1 (8B/70B/405B), Llama3 70B, Mistral Large, DeepSeek Coder (33B and 6.7B), CodeGemma 7B, and RTL-Coder-against an improved VerilogEval benchmark suite. We find measurable improvements in state-of-the-art models: GPT-4o achieves a 63% pass rate on specification-to-RTL tasks. The recently released and open Llama3.1 405B achieves a 58% pass rate, almost matching GPT-4o, while the smaller domain-specific RTL-Coder 6.7B models achieve an impressive 34% pass rate. Additionally, we enhance VerilogEval's infrastructure by automatically classifying failures, introducing in-context learning support, and extending the tasks to specification-to-RTL translation. We find that prompt engineering remains crucial for achieving good pass rates and varies widely with model and task. A benchmark infrastructure that allows for prompt engineering and failure analysis is essential for continued model development and deployment.

Revisiting VerilogEval: A Year of Improvements in Large-Language Models for Hardware Code Generation

TL;DR

Abstract

Paper Structure (16 sections, 1 equation, 5 figures, 2 tables)

This paper contains 16 sections, 1 equation, 5 figures, 2 tables.

Introduction
VerilogEval v1 Revisited
VerilogEval v2 Improvements
Specification-to-RTL Task Support
Support for In-Context Learning Examples
Support for Failure Classification
Other Infrastructural Improvements
VerilogEval v2 Evaluation
Impact of ICL on Pass Rates and Failures
Increased In-Context Learning Examples
Case Study: Problem 9 and Problem 34
Aggregate Failure Analysis
Future Work
Agent-based Code Generation
Related Hardware Design Tasks
...and 1 more sections

Figures (5)

Figure 1: Pass rate across recent large-language models similar to VerilogEval v1 for pass@1. Green models are closed general-purpose models, orange are open general-purpose models, dark blue are coding-specific models, and light blue is an RTL-specific model.
Figure 2: Overview of VerilogEval v2 flow.
Figure 3: Pass rate across recent large-language models. Green models are closed general-purpose models, orange are open general-purpose models, dark blue are coding-specific models, and light blue is an RTL-specific model. Purple is the older Llama3 70B to demonstrate a large degradation due to ICLs.
Figure 4: Pass rate of three models for code completion and specification-to-RTL tasks, with 0-shot to 3-shot in-context learning examples. Solid lines are code completion and dashed lines are spec-to-RTL.
Figure 5: Failure classification for Llama2 70B, Llama3 70B, and Llama3.1 70B models with 0-shot to 3-shot ICL examples across the two tasks. Orange coloring indicates compiler errors, while blue indicates runtime issues.

Revisiting VerilogEval: A Year of Improvements in Large-Language Models for Hardware Code Generation

TL;DR

Abstract

Revisiting VerilogEval: A Year of Improvements in Large-Language Models for Hardware Code Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)