Table of Contents
Fetching ...

Evaluating LLMs for Hardware Design and Test

Jason Blocklove, Siddharth Garg, Ramesh Karri, Hammond Pearce

TL;DR

The paper investigates whether contemporary LLMs can simultaneously design Verilog hardware modules from natural language specifications and generate verification testbenches, using eight real-world benchmarks and a Tiny Tapeout 3 silicon tapeout on Skywater $130$nm. It implements a prompt-driven, iterative design–test loop with a structured feedback taxonomy (TF, SHF, MHF, AHF) and evaluates multiple conversational LLMs. The main finding is that ChatGPT-4 can produce functional designs and partial testbenches, but testbench generation remains fragile and often requires substantial human guidance, while some models fail to meet initial specifications or produce noncompliant passes; post-silicon validation confirms functional behavior for the successful cases. These results suggest that LLMs can aid, but not yet fully automate, hardware design and verification, highlighting a path toward improved models and tooling for end-to-end design pipelines in practical semiconductor workflows.

Abstract

Large Language Models (LLMs) have demonstrated capabilities for producing code in Hardware Description Languages (HDLs). However, most of the focus remains on their abilities to write functional code, not test code. The hardware design process consists of both design and test, and so eschewing validation and verification leaves considerable potential benefit unexplored, given that a design and test framework may allow for progress towards full automation of the digital design pipeline. In this work, we perform one of the first studies exploring how a LLM can both design and test hardware modules from provided specifications. Using a suite of 8 representative benchmarks, we examined the capabilities and limitations of the state-of-the-art conversational LLMs when producing Verilog for functional and verification purposes. We taped out the benchmarks on a Skywater 130nm shuttle and received the functional chip.

Evaluating LLMs for Hardware Design and Test

TL;DR

The paper investigates whether contemporary LLMs can simultaneously design Verilog hardware modules from natural language specifications and generate verification testbenches, using eight real-world benchmarks and a Tiny Tapeout 3 silicon tapeout on Skywater nm. It implements a prompt-driven, iterative design–test loop with a structured feedback taxonomy (TF, SHF, MHF, AHF) and evaluates multiple conversational LLMs. The main finding is that ChatGPT-4 can produce functional designs and partial testbenches, but testbench generation remains fragile and often requires substantial human guidance, while some models fail to meet initial specifications or produce noncompliant passes; post-silicon validation confirms functional behavior for the successful cases. These results suggest that LLMs can aid, but not yet fully automate, hardware design and verification, highlighting a path toward improved models and tooling for end-to-end design pipelines in practical semiconductor workflows.

Abstract

Large Language Models (LLMs) have demonstrated capabilities for producing code in Hardware Description Languages (HDLs). However, most of the focus remains on their abilities to write functional code, not test code. The hardware design process consists of both design and test, and so eschewing validation and verification leaves considerable potential benefit unexplored, given that a design and test framework may allow for progress towards full automation of the digital design pipeline. In this work, we perform one of the first studies exploring how a LLM can both design and test hardware modules from provided specifications. Using a suite of 8 representative benchmarks, we examined the capabilities and limitations of the state-of-the-art conversational LLMs when producing Verilog for functional and verification purposes. We taped out the benchmarks on a Skywater 130nm shuttle and received the functional chip.
Paper Structure (15 sections, 10 figures, 3 tables)

This paper contains 15 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Simplified LLM conversation flowchart
  • Figure 1: 8-bit shift register attempt from ChatGPT-3.5.
  • Figure 2: Design prompt with 8-bit shift register example. Lines 2-8 would be updated depending upon the desired spec.
  • Figure 2: 8-bit shift register attempt by Bard. Input on line 4 is too wide.
  • Figure 3: Testbench prompt. This prompt remains constant.
  • ...and 5 more figures