Table of Contents
Fetching ...

Revisiting VerilogEval: A Year of Improvements in Large-Language Models for Hardware Code Generation

Nathaniel Pinckney, Christopher Batten, Mingjie Liu, Haoxing Ren, Brucek Khailany

TL;DR

The paper analyzes progress in large-language models applied to hardware code generation by revisiting VerilogEval and introducing VerilogEval v2, which adds specification-to-RTL support, in-context learning, failure classification, and an infrastructure overhaul. It evaluates a broad set of models, including GPT-4o, GPT-4 Turbo, Llama3.1 variants, RTL-Coder, and DeepSeek, across code completion and specification-to-RTL tasks, revealing that open models can match or approach closed-model performance, with prompt tuning playing a crucial role. Key findings show GPT-4o achieving leading performance in 1-shot settings, Llama3.1 405B closing gaps as an open model, and RTL-specific models offering competitive results at smaller sizes, while ICL effects are highly model- and task-dependent. The enhanced benchmark and failure-classification framework provide granular insights into error modes, enabling targeted prompt engineering and laying groundwork for future agent-based, multi-turn hardware design workflows. Overall, the work underscores the importance of task-aligned evaluation, scalable infrastructure, and richer feedback mechanisms to advance LLM-assisted hardware design toward practical deployment.

Abstract

The application of large-language models (LLMs) to digital hardware code generation is an emerging field, with most LLMs primarily trained on natural language and software code. Hardware code like Verilog constitutes a small portion of training data, and few hardware benchmarks exist. The open-source VerilogEval benchmark, released in November 2023, provided a consistent evaluation framework for LLMs on code completion tasks. Since then, both commercial and open models have seen significant development. In this work, we evaluate new commercial and open models since VerilogEval's original release-including GPT-4o, GPT-4 Turbo, Llama3.1 (8B/70B/405B), Llama3 70B, Mistral Large, DeepSeek Coder (33B and 6.7B), CodeGemma 7B, and RTL-Coder-against an improved VerilogEval benchmark suite. We find measurable improvements in state-of-the-art models: GPT-4o achieves a 63% pass rate on specification-to-RTL tasks. The recently released and open Llama3.1 405B achieves a 58% pass rate, almost matching GPT-4o, while the smaller domain-specific RTL-Coder 6.7B models achieve an impressive 34% pass rate. Additionally, we enhance VerilogEval's infrastructure by automatically classifying failures, introducing in-context learning support, and extending the tasks to specification-to-RTL translation. We find that prompt engineering remains crucial for achieving good pass rates and varies widely with model and task. A benchmark infrastructure that allows for prompt engineering and failure analysis is essential for continued model development and deployment.

Revisiting VerilogEval: A Year of Improvements in Large-Language Models for Hardware Code Generation

TL;DR

The paper analyzes progress in large-language models applied to hardware code generation by revisiting VerilogEval and introducing VerilogEval v2, which adds specification-to-RTL support, in-context learning, failure classification, and an infrastructure overhaul. It evaluates a broad set of models, including GPT-4o, GPT-4 Turbo, Llama3.1 variants, RTL-Coder, and DeepSeek, across code completion and specification-to-RTL tasks, revealing that open models can match or approach closed-model performance, with prompt tuning playing a crucial role. Key findings show GPT-4o achieving leading performance in 1-shot settings, Llama3.1 405B closing gaps as an open model, and RTL-specific models offering competitive results at smaller sizes, while ICL effects are highly model- and task-dependent. The enhanced benchmark and failure-classification framework provide granular insights into error modes, enabling targeted prompt engineering and laying groundwork for future agent-based, multi-turn hardware design workflows. Overall, the work underscores the importance of task-aligned evaluation, scalable infrastructure, and richer feedback mechanisms to advance LLM-assisted hardware design toward practical deployment.

Abstract

The application of large-language models (LLMs) to digital hardware code generation is an emerging field, with most LLMs primarily trained on natural language and software code. Hardware code like Verilog constitutes a small portion of training data, and few hardware benchmarks exist. The open-source VerilogEval benchmark, released in November 2023, provided a consistent evaluation framework for LLMs on code completion tasks. Since then, both commercial and open models have seen significant development. In this work, we evaluate new commercial and open models since VerilogEval's original release-including GPT-4o, GPT-4 Turbo, Llama3.1 (8B/70B/405B), Llama3 70B, Mistral Large, DeepSeek Coder (33B and 6.7B), CodeGemma 7B, and RTL-Coder-against an improved VerilogEval benchmark suite. We find measurable improvements in state-of-the-art models: GPT-4o achieves a 63% pass rate on specification-to-RTL tasks. The recently released and open Llama3.1 405B achieves a 58% pass rate, almost matching GPT-4o, while the smaller domain-specific RTL-Coder 6.7B models achieve an impressive 34% pass rate. Additionally, we enhance VerilogEval's infrastructure by automatically classifying failures, introducing in-context learning support, and extending the tasks to specification-to-RTL translation. We find that prompt engineering remains crucial for achieving good pass rates and varies widely with model and task. A benchmark infrastructure that allows for prompt engineering and failure analysis is essential for continued model development and deployment.
Paper Structure (16 sections, 1 equation, 5 figures, 2 tables)

This paper contains 16 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Pass rate across recent large-language models similar to VerilogEval v1 for pass@1. Green models are closed general-purpose models, orange are open general-purpose models, dark blue are coding-specific models, and light blue is an RTL-specific model.
  • Figure 2: Overview of VerilogEval v2 flow.
  • Figure 3: Pass rate across recent large-language models. Green models are closed general-purpose models, orange are open general-purpose models, dark blue are coding-specific models, and light blue is an RTL-specific model. Purple is the older Llama3 70B to demonstrate a large degradation due to ICLs.
  • Figure 4: Pass rate of three models for code completion and specification-to-RTL tasks, with 0-shot to 3-shot in-context learning examples. Solid lines are code completion and dashed lines are spec-to-RTL.
  • Figure 5: Failure classification for Llama2 70B, Llama3 70B, and Llama3.1 70B models with 0-shot to 3-shot ICL examples across the two tasks. Orange coloring indicates compiler errors, while blue indicates runtime issues.