Table of Contents
Fetching ...

VHDL-Eval: A Framework for Evaluating Large Language Models in VHDL Code Generation

Prashanth Vijayaraghavan, Luyao Shi, Stefano Ambrogio, Charles Mackin, Apoorva Nitsure, David Beymer, Ehsan Degan

TL;DR

The paper tackles evaluating large language models for $VHDL$ code generation, addressing the lack of HDL-specific benchmarks. It introduces the $VHDL$-Eval dataset, combining 202 problems translated from Verilog-Eval with publicly available VHDL tutorials, plus self-verifying testbenches, enabling systematic syntactic and functional evaluation. The study assesses multiple models under zero-shot, in-context learning ($ICL$), and parameter-efficient fine-tuning (PEFT) using the $Pass@k$ metric, revealing substantial challenges and only modest gains from $ICL$ and notable gains from adapter-based fine-tuning ($QLoRa$). The results highlight the need for supervised VHDL-specific fine-tuning to bridge performance gaps and provide a practical framework for hardware designers seeking automated VHDL-code generation assistance.

Abstract

With the unprecedented advancements in Large Language Models (LLMs), their application domains have expanded to include code generation tasks across various programming languages. While significant progress has been made in enhancing LLMs for popular programming languages, there exists a notable gap in comprehensive evaluation frameworks tailored for Hardware Description Languages (HDLs), particularly VHDL. This paper addresses this gap by introducing a comprehensive evaluation framework designed specifically for assessing LLM performance in VHDL code generation task. We construct a dataset for evaluating LLMs on VHDL code generation task. This dataset is constructed by translating a collection of Verilog evaluation problems to VHDL and aggregating publicly available VHDL problems, resulting in a total of 202 problems. To assess the functional correctness of the generated VHDL code, we utilize a curated set of self-verifying testbenches specifically designed for those aggregated VHDL problem set. We conduct an initial evaluation of different LLMs and their variants, including zero-shot code generation, in-context learning (ICL), and Parameter-efficient fine-tuning (PEFT) methods. Our findings underscore the considerable challenges faced by existing LLMs in VHDL code generation, revealing significant scope for improvement. This study emphasizes the necessity of supervised fine-tuning code generation models specifically for VHDL, offering potential benefits to VHDL designers seeking efficient code generation solutions.

VHDL-Eval: A Framework for Evaluating Large Language Models in VHDL Code Generation

TL;DR

The paper tackles evaluating large language models for code generation, addressing the lack of HDL-specific benchmarks. It introduces the -Eval dataset, combining 202 problems translated from Verilog-Eval with publicly available VHDL tutorials, plus self-verifying testbenches, enabling systematic syntactic and functional evaluation. The study assesses multiple models under zero-shot, in-context learning (), and parameter-efficient fine-tuning (PEFT) using the metric, revealing substantial challenges and only modest gains from and notable gains from adapter-based fine-tuning (). The results highlight the need for supervised VHDL-specific fine-tuning to bridge performance gaps and provide a practical framework for hardware designers seeking automated VHDL-code generation assistance.

Abstract

With the unprecedented advancements in Large Language Models (LLMs), their application domains have expanded to include code generation tasks across various programming languages. While significant progress has been made in enhancing LLMs for popular programming languages, there exists a notable gap in comprehensive evaluation frameworks tailored for Hardware Description Languages (HDLs), particularly VHDL. This paper addresses this gap by introducing a comprehensive evaluation framework designed specifically for assessing LLM performance in VHDL code generation task. We construct a dataset for evaluating LLMs on VHDL code generation task. This dataset is constructed by translating a collection of Verilog evaluation problems to VHDL and aggregating publicly available VHDL problems, resulting in a total of 202 problems. To assess the functional correctness of the generated VHDL code, we utilize a curated set of self-verifying testbenches specifically designed for those aggregated VHDL problem set. We conduct an initial evaluation of different LLMs and their variants, including zero-shot code generation, in-context learning (ICL), and Parameter-efficient fine-tuning (PEFT) methods. Our findings underscore the considerable challenges faced by existing LLMs in VHDL code generation, revealing significant scope for improvement. This study emphasizes the necessity of supervised fine-tuning code generation models specifically for VHDL, offering potential benefits to VHDL designers seeking efficient code generation solutions.
Paper Structure (16 sections, 1 equation, 3 figures, 2 tables)

This paper contains 16 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of our Evaluation Framework for VHDL Code Generation
  • Figure .1: Left: Verilog half-adder problem statement and its canonical solution from the Verilog-Eval Dataset. Center: VHDL canonical solution for the half-adder obtained by translating the Verilog code using ICARUS Verilog tool. Right: Section of the self-verifying VHDL testbench for the half-adder, including test cases within the VUnit testing framework.
  • Figure 2: Example reframing of a standard problem as prompt.