Table of Contents
Fetching ...

Automatically Improving LLM-based Verilog Generation using EDA Tool Feedback

Jason Blocklove, Shailja Thakur, Benjamin Tan, Hammond Pearce, Siddharth Garg, Ramesh Karri

TL;DR

This work investigates using automated EDA-tool feedback to repair LLM-generated Verilog, introducing AutoChip as an open-source framework that iteratively evaluates candidate designs via HDL compilers and testbenches and feeds error-driven feedback back to the LLM over a tree search with parameters $k$ and $d$. Evaluations on the VerilogEval benchmark show that tool feedback markedly improves results for GPT-4o, achieving up to a $5.8 ext{}$$ increase in passing designs and significant cost reductions; mixing smaller models with a final GPT-4o pass can reach similar success levels at substantially lower cost. The study highlights that feedback effectiveness is model-dependent, that increasing $k$ and $d$ generally helps, and that succinct feedback can rival full-context feedback while reducing token usage. The open-source AutoChip platform enables broader evaluation across more models and benchmarks, paving the way for automated, tool-guided hardware design workflows.

Abstract

Traditionally, digital hardware designs are written in the Verilog hardware description language (HDL) and debugged manually by engineers. This can be time-consuming and error-prone for complex designs. Large Language Models (LLMs) are emerging as a potential tool to help generate fully functioning HDL code, but most works have focused on generation in the single-shot capacity: i.e., run and evaluate, a process that does not leverage debugging and, as such, does not adequately reflect a realistic development process. In this work, we evaluate the ability of LLMs to leverage feedback from electronic design automation (EDA) tools to fix mistakes in their own generated Verilog. To accomplish this, we present an open-source, highly customizable framework, AutoChip, which combines conversational LLMs with the output from Verilog compilers and simulations to iteratively generate and repair Verilog. To determine the success of these LLMs we leverage the VerilogEval benchmark set. We evaluate four state-of-the-art conversational LLMs, focusing on readily accessible commercial models. EDA tool feedback proved to be consistently more effective than zero-shot prompting only with GPT-4o, the most computationally complex model we evaluated. In the best case, we observed a 5.8% increase in the number of successful designs with a 34.2% decrease in cost over the best zero-shot results. Mixing smaller models with this larger model at the end of the feedback iterations resulted in equally as much success as with GPT-4o using feedback, but incurred 41.9% lower cost (corresponding to an overall decrease in cost over zero-shot by 89.6%).

Automatically Improving LLM-based Verilog Generation using EDA Tool Feedback

TL;DR

This work investigates using automated EDA-tool feedback to repair LLM-generated Verilog, introducing AutoChip as an open-source framework that iteratively evaluates candidate designs via HDL compilers and testbenches and feeds error-driven feedback back to the LLM over a tree search with parameters and . Evaluations on the VerilogEval benchmark show that tool feedback markedly improves results for GPT-4o, achieving up to a kd$ generally helps, and that succinct feedback can rival full-context feedback while reducing token usage. The open-source AutoChip platform enables broader evaluation across more models and benchmarks, paving the way for automated, tool-guided hardware design workflows.

Abstract

Traditionally, digital hardware designs are written in the Verilog hardware description language (HDL) and debugged manually by engineers. This can be time-consuming and error-prone for complex designs. Large Language Models (LLMs) are emerging as a potential tool to help generate fully functioning HDL code, but most works have focused on generation in the single-shot capacity: i.e., run and evaluate, a process that does not leverage debugging and, as such, does not adequately reflect a realistic development process. In this work, we evaluate the ability of LLMs to leverage feedback from electronic design automation (EDA) tools to fix mistakes in their own generated Verilog. To accomplish this, we present an open-source, highly customizable framework, AutoChip, which combines conversational LLMs with the output from Verilog compilers and simulations to iteratively generate and repair Verilog. To determine the success of these LLMs we leverage the VerilogEval benchmark set. We evaluate four state-of-the-art conversational LLMs, focusing on readily accessible commercial models. EDA tool feedback proved to be consistently more effective than zero-shot prompting only with GPT-4o, the most computationally complex model we evaluated. In the best case, we observed a 5.8% increase in the number of successful designs with a 34.2% decrease in cost over the best zero-shot results. Mixing smaller models with this larger model at the end of the feedback iterations resulted in equally as much success as with GPT-4o using feedback, but incurred 41.9% lower cost (corresponding to an overall decrease in cost over zero-shot by 89.6%).

Paper Structure

This paper contains 12 sections, 14 figures, 5 tables.

Figures (14)

  • Figure 1: AutoChip uses an initial design prompt to get a Verilog design from a target LLM. Multiple ($k$) candidate responses can be generated per-prompt, which are then each evaluated and ranked using the feedback from HDL compilers and testbench simulations to identify mismatches compared to a reference design. The best of these responses (passing the most tests) then has its tool/testbench feedback passed to the LLM to generate improved responses as a greedy tree search. This is done up to a tree depth of $d$.
  • Figure 2: An example configuration file for AutoChip. "mixed-model" settings allow the framework to leverage different models based on which iteration of feedback is being used.
  • Figure 3: A subset of the output of running AutoChip to generate the rule110 VerilogEval-Human benchmark.
  • Figure 4: A subset of the generated output file structure from AutoChip generating the rule110 VerilogEval-Human benchmark.
  • Figure 5: System prompt/context for LLM interactions
  • ...and 9 more figures