Evaluating LLMs for Hardware Design and Test

Jason Blocklove; Siddharth Garg; Ramesh Karri; Hammond Pearce

Evaluating LLMs for Hardware Design and Test

Jason Blocklove, Siddharth Garg, Ramesh Karri, Hammond Pearce

TL;DR

The paper investigates whether contemporary LLMs can simultaneously design Verilog hardware modules from natural language specifications and generate verification testbenches, using eight real-world benchmarks and a Tiny Tapeout 3 silicon tapeout on Skywater $130$nm. It implements a prompt-driven, iterative design–test loop with a structured feedback taxonomy (TF, SHF, MHF, AHF) and evaluates multiple conversational LLMs. The main finding is that ChatGPT-4 can produce functional designs and partial testbenches, but testbench generation remains fragile and often requires substantial human guidance, while some models fail to meet initial specifications or produce noncompliant passes; post-silicon validation confirms functional behavior for the successful cases. These results suggest that LLMs can aid, but not yet fully automate, hardware design and verification, highlighting a path toward improved models and tooling for end-to-end design pipelines in practical semiconductor workflows.

Abstract

Large Language Models (LLMs) have demonstrated capabilities for producing code in Hardware Description Languages (HDLs). However, most of the focus remains on their abilities to write functional code, not test code. The hardware design process consists of both design and test, and so eschewing validation and verification leaves considerable potential benefit unexplored, given that a design and test framework may allow for progress towards full automation of the digital design pipeline. In this work, we perform one of the first studies exploring how a LLM can both design and test hardware modules from provided specifications. Using a suite of 8 representative benchmarks, we examined the capabilities and limitations of the state-of-the-art conversational LLMs when producing Verilog for functional and verification purposes. We taped out the benchmarks on a Skywater 130nm shuttle and received the functional chip.

Evaluating LLMs for Hardware Design and Test

TL;DR

nm. It implements a prompt-driven, iterative design–test loop with a structured feedback taxonomy (TF, SHF, MHF, AHF) and evaluates multiple conversational LLMs. The main finding is that ChatGPT-4 can produce functional designs and partial testbenches, but testbench generation remains fragile and often requires substantial human guidance, while some models fail to meet initial specifications or produce noncompliant passes; post-silicon validation confirms functional behavior for the successful cases. These results suggest that LLMs can aid, but not yet fully automate, hardware design and verification, highlighting a path toward improved models and tooling for end-to-end design pipelines in practical semiconductor workflows.

Abstract

Paper Structure (15 sections, 10 figures, 3 tables)

This paper contains 15 sections, 10 figures, 3 tables.

Introduction
Background and Related Work
Large Language Models (LLMs)
LLM Aided Design
Prompting LLMs for Design and Test
Methodology
Real-world design constraints on benchmark design
Challenge benchmarks
Model evaluation: Metrics
Example conversation
Results
Simulation Results
Silicon Results
Evaluation
Conclusion

Figures (10)

Figure 1: Simplified LLM conversation flowchart
Figure 1: 8-bit shift register attempt from ChatGPT-3.5.
Figure 2: Design prompt with 8-bit shift register example. Lines 2-8 would be updated depending upon the desired spec.
Figure 2: 8-bit shift register attempt by Bard. Input on line 4 is too wide.
Figure 3: Testbench prompt. This prompt remains constant.
...and 5 more figures

Evaluating LLMs for Hardware Design and Test

TL;DR

Abstract

Evaluating LLMs for Hardware Design and Test

Authors

TL;DR

Abstract

Table of Contents

Figures (10)