Table of Contents
Fetching ...

FVEval: Understanding Language Model Capabilities in Formal Verification of Digital Hardware

Minwoo Kang, Mingjie Liu, Ghaith Bany Hamad, Syed Suhaib, Haoxing Ren

TL;DR

FVEval is presented, the first comprehensive benchmark and evaluation framework for characterizing LLM performance in tasks pertaining to FV, and a wide range of existing LLMs are evaluated against FVEval, to investigate where today's LLMs stand and how they might further enable their application toward improving productivity in digital FV.

Abstract

The remarkable reasoning and code generation capabilities of large language models (LLMs) have spurred significant interest in applying LLMs to enable task automation in digital chip design. In particular, recent work has investigated early ideas of applying these models to formal verification (FV), an approach to verifying hardware implementations that can provide strong guarantees of confidence but demands significant amounts of human effort. While the value of LLM-driven automation is evident, our understanding of model performance, however, has been hindered by the lack of holistic evaluation. In response, we present FVEval, the first comprehensive benchmark and evaluation framework for characterizing LLM performance in tasks pertaining to FV. The benchmark consists of three sub-tasks that measure LLM capabilities at different levels: from the generation of SystemVerilog assertions (SVAs) given natural language descriptions to reasoning about the design RTL and suggesting assertions directly without additional human input. As test instances, we present both collections of expert-written verification collateral and methodologies to scalably generate synthetic examples aligned with industrial FV workflows. A wide range of existing LLMs, both proprietary and open-source, are evaluated against FVEval, based on which we investigate where today's LLMs stand and how we might further enable their application toward improving productivity in digital FV. Our benchmark and evaluation code is available at \url{https://github.com/NVlabs/FVEval}.

FVEval: Understanding Language Model Capabilities in Formal Verification of Digital Hardware

TL;DR

FVEval is presented, the first comprehensive benchmark and evaluation framework for characterizing LLM performance in tasks pertaining to FV, and a wide range of existing LLMs are evaluated against FVEval, to investigate where today's LLMs stand and how they might further enable their application toward improving productivity in digital FV.

Abstract

The remarkable reasoning and code generation capabilities of large language models (LLMs) have spurred significant interest in applying LLMs to enable task automation in digital chip design. In particular, recent work has investigated early ideas of applying these models to formal verification (FV), an approach to verifying hardware implementations that can provide strong guarantees of confidence but demands significant amounts of human effort. While the value of LLM-driven automation is evident, our understanding of model performance, however, has been hindered by the lack of holistic evaluation. In response, we present FVEval, the first comprehensive benchmark and evaluation framework for characterizing LLM performance in tasks pertaining to FV. The benchmark consists of three sub-tasks that measure LLM capabilities at different levels: from the generation of SystemVerilog assertions (SVAs) given natural language descriptions to reasoning about the design RTL and suggesting assertions directly without additional human input. As test instances, we present both collections of expert-written verification collateral and methodologies to scalably generate synthetic examples aligned with industrial FV workflows. A wide range of existing LLMs, both proprietary and open-source, are evaluated against FVEval, based on which we investigate where today's LLMs stand and how we might further enable their application toward improving productivity in digital FV. Our benchmark and evaluation code is available at \url{https://github.com/NVlabs/FVEval}.

Paper Structure

This paper contains 32 sections, 28 figures, 6 tables.

Figures (28)

  • Figure 1: Overview of the FVEval benchmark and evaluation flow. FVEval consists of three sub-benchmarks NL2SVA-Human, NL2SVA-Machine, and Design2SVA that measure model capabilities in generating functionally correct implementations of assertions from NL descriptions and also directly from design under test. The evaluation framework integrates industry-standard FV tools to accurately measure LLM response correctness.
  • Figure 1: Evaluation results for the NL2SVA-Human Benchmark. Boldface and underlined results indicate highest and second highest metric values, respectively. We see that models that proprietary models with generally advanced capabilities for code generation and logical reasoning achieve the best performance across all metrics. Notably, all models show a significant gap between full and partial equivalence of assertion functionality---this highlights the value of fine-grained measurement of functional correctness that captures how close LLM-generated assertions are similar to the human-written reference solution.
  • Figure 2: (Left) Input and output examples of the NL2SVA-Human benchmark. Reference solution is the SVA assertion written by the human engineer that the LLM response is expected to match in terms of functionality. (Right) Distribution of token length for the NL specifications and the reference SVA solutions contained in the benchmark.
  • Figure 3: (Left) Input and output examples of the NL2SVA-Machine benchmark. The dataset consists of naturalized descriptions of the formal-symbolic logic formulae and the corresponding SVA assertion, based on which the NL annotations were created. (Right) Distribution of lengths of NL descriptions and SVA assertions, measured in token length using the Llama3 tokenizer.
  • Figure 4: Synthetic RTL Testcase generation. For each category of designs, we vary a list of control parameters to generate test design instances such that the total set of 96 cases for each category exhibit a wide distribution of difficulties. The total token length of the randomly generated arithmetic logic and FSM transition logic, correlated with complexity of designs under FV, are shown.
  • ...and 23 more figures