Table of Contents
Fetching ...

AssertionBench: A Benchmark to Evaluate Large-Language Models for Assertion Generation

Vaishnavi Pulavarthi, Deeksha Nandal, Soham Dan, Debjit Pal

TL;DR

AssertionBench addresses the problem of scaling and evaluating LLM-based assertion generation for hardware verification. It provides a quantitative benchmark with 100 OpenCores Verilog designs and formally verified ground-truth assertions from GoldMine and HARM, enabling 1-shot and 5-shot in-context evaluation using four SOTA LLMs and a JasperGold verification pipeline. The results show GPT-4o offers the most consistent gains in valid assertions, but no model achieves high correctness across test designs, highlighting substantial room for improvement and the need for robust benchmarking. This benchmark offers a practical, standardized way to compare current and future LLMs for industrial-scale assertion generation and guides model design and data selection for verification tasks.

Abstract

Assertions have been the de facto collateral for simulation-based and formal verification of hardware designs for over a decade. The quality of hardware verification, \ie, detection and diagnosis of corner-case design bugs, is critically dependent on the quality of the assertions. There has been a considerable amount of research leveraging a blend of data-driven statistical analysis and static analysis to generate high-quality assertions from hardware design source code and design execution trace data. Despite such concerted effort, all prior research struggles to scale to industrial-scale large designs, generates too many low-quality assertions, often fails to capture subtle and non-trivial design functionality, and does not produce any easy-to-comprehend explanations of the generated assertions to understand assertions' suitability to different downstream validation tasks. Recently, with the advent of Large-Language Models (LLMs), there has been a widespread effort to leverage prompt engineering to generate assertions. However, there is little effort to quantitatively establish the effectiveness and suitability of various LLMs for assertion generation. In this paper, we present AssertionBench, a novel benchmark to evaluate LLMs' effectiveness for assertion generation quantitatively. AssertioBench contains 100 curated Verilog hardware designs from OpenCores and formally verified assertions for each design generated from GoldMine and HARM. We use AssertionBench to compare state-of-the-art LLMs to assess their effectiveness in inferring functionally correct assertions for hardware designs. Our experiments demonstrate how LLMs perform relative to each other, the benefits of using more in-context exemplars in generating a higher fraction of functionally correct assertions, and the significant room for improvement for LLM-based assertion generators.

AssertionBench: A Benchmark to Evaluate Large-Language Models for Assertion Generation

TL;DR

AssertionBench addresses the problem of scaling and evaluating LLM-based assertion generation for hardware verification. It provides a quantitative benchmark with 100 OpenCores Verilog designs and formally verified ground-truth assertions from GoldMine and HARM, enabling 1-shot and 5-shot in-context evaluation using four SOTA LLMs and a JasperGold verification pipeline. The results show GPT-4o offers the most consistent gains in valid assertions, but no model achieves high correctness across test designs, highlighting substantial room for improvement and the need for robust benchmarking. This benchmark offers a practical, standardized way to compare current and future LLMs for industrial-scale assertion generation and guides model design and data selection for verification tasks.

Abstract

Assertions have been the de facto collateral for simulation-based and formal verification of hardware designs for over a decade. The quality of hardware verification, \ie, detection and diagnosis of corner-case design bugs, is critically dependent on the quality of the assertions. There has been a considerable amount of research leveraging a blend of data-driven statistical analysis and static analysis to generate high-quality assertions from hardware design source code and design execution trace data. Despite such concerted effort, all prior research struggles to scale to industrial-scale large designs, generates too many low-quality assertions, often fails to capture subtle and non-trivial design functionality, and does not produce any easy-to-comprehend explanations of the generated assertions to understand assertions' suitability to different downstream validation tasks. Recently, with the advent of Large-Language Models (LLMs), there has been a widespread effort to leverage prompt engineering to generate assertions. However, there is little effort to quantitatively establish the effectiveness and suitability of various LLMs for assertion generation. In this paper, we present AssertionBench, a novel benchmark to evaluate LLMs' effectiveness for assertion generation quantitatively. AssertioBench contains 100 curated Verilog hardware designs from OpenCores and formally verified assertions for each design generated from GoldMine and HARM. We use AssertionBench to compare state-of-the-art LLMs to assess their effectiveness in inferring functionally correct assertions for hardware designs. Our experiments demonstrate how LLMs perform relative to each other, the benefits of using more in-context exemplars in generating a higher fraction of functionally correct assertions, and the significant room for improvement for LLM-based assertion generators.

Paper Structure

This paper contains 6 sections, 3 figures.

Figures (3)

  • Figure 1: Our evaluation framework. JG: JasperGold Formal Property Verification Engine.
  • Figure 2: An example of the prompt for 1-shot learning. The example consists of a tuple, a Verilog design ( Program 1) and a set of formally verified assertions for the design ( Assertions 1). The Test Program is the Verilog design for which we generate assertions using the trained LLM.
  • Figure 3: Comparison of accuracy of generated assertions. (\ref{['fig:gpt35']}) Assertion accuracy comparison for $\hbox{GPT-3.5}$. (\ref{['fig:gpt4o']}) Assertion accuracy comparison for $\hbox{GPT-4o}$. (\ref{['fig:codellama2']}) Assertion accuracy comparison for $\hbox{CodeLLaMa 2}$. (\ref{['fig:llama3']}) Assertion accuracy comparison for $\hbox{LLaMa3-70B}$. (\ref{['fig:1_shot_acc']}) $k = 1$-shot assertion accuracy. (\ref{['fig:5_shot_acc']}) $k = 5$-shot assertion accuracy. CEX: Counterexamples trace.