Revisiting VerilogEval: A Year of Improvements in Large-Language Models for Hardware Code Generation
Nathaniel Pinckney, Christopher Batten, Mingjie Liu, Haoxing Ren, Brucek Khailany
TL;DR
The paper analyzes progress in large-language models applied to hardware code generation by revisiting VerilogEval and introducing VerilogEval v2, which adds specification-to-RTL support, in-context learning, failure classification, and an infrastructure overhaul. It evaluates a broad set of models, including GPT-4o, GPT-4 Turbo, Llama3.1 variants, RTL-Coder, and DeepSeek, across code completion and specification-to-RTL tasks, revealing that open models can match or approach closed-model performance, with prompt tuning playing a crucial role. Key findings show GPT-4o achieving leading performance in 1-shot settings, Llama3.1 405B closing gaps as an open model, and RTL-specific models offering competitive results at smaller sizes, while ICL effects are highly model- and task-dependent. The enhanced benchmark and failure-classification framework provide granular insights into error modes, enabling targeted prompt engineering and laying groundwork for future agent-based, multi-turn hardware design workflows. Overall, the work underscores the importance of task-aligned evaluation, scalable infrastructure, and richer feedback mechanisms to advance LLM-assisted hardware design toward practical deployment.
Abstract
The application of large-language models (LLMs) to digital hardware code generation is an emerging field, with most LLMs primarily trained on natural language and software code. Hardware code like Verilog constitutes a small portion of training data, and few hardware benchmarks exist. The open-source VerilogEval benchmark, released in November 2023, provided a consistent evaluation framework for LLMs on code completion tasks. Since then, both commercial and open models have seen significant development. In this work, we evaluate new commercial and open models since VerilogEval's original release-including GPT-4o, GPT-4 Turbo, Llama3.1 (8B/70B/405B), Llama3 70B, Mistral Large, DeepSeek Coder (33B and 6.7B), CodeGemma 7B, and RTL-Coder-against an improved VerilogEval benchmark suite. We find measurable improvements in state-of-the-art models: GPT-4o achieves a 63% pass rate on specification-to-RTL tasks. The recently released and open Llama3.1 405B achieves a 58% pass rate, almost matching GPT-4o, while the smaller domain-specific RTL-Coder 6.7B models achieve an impressive 34% pass rate. Additionally, we enhance VerilogEval's infrastructure by automatically classifying failures, introducing in-context learning support, and extending the tasks to specification-to-RTL translation. We find that prompt engineering remains crucial for achieving good pass rates and varies widely with model and task. A benchmark infrastructure that allows for prompt engineering and failure analysis is essential for continued model development and deployment.
