Table of Contents
Fetching ...

ResBench: Benchmarking LLM-Generated FPGA Designs with Resource Awareness

Ce Guo, Tong Zhao

TL;DR

ResBench addresses the scarcity of hardware-resource-aware evaluation for LLM-generated HDL by introducing a resource-centric FPGA benchmark and an automated framework that generates Verilog, validates functionality, and measures resource usage, notably LUTs, across 56 problems in 12 domains. It compares nine LLMs spanning general-purpose, code-specialized, and HDL-specialized families, revealing substantial variation in resource optimization capabilities and identifying models that produce more hardware-efficient designs. The key contributions are a 56-problem, 12-domain benchmark, an open-source evaluation pipeline, and an empirical study showing resource-awareness can distinguish LLMs beyond functional correctness. The work has practical impact for AI-assisted FPGA design, guiding model development toward hardware-aware HDL generation and enabling repeatable, scalable hardware-resource benchmarking.

Abstract

Field-Programmable Gate Arrays (FPGAs) are widely used in modern hardware design, yet writing Hardware Description Language (HDL) code for FPGA implementation remains a complex and time-consuming task. Large Language Models (LLMs) have emerged as a promising tool for HDL generation, but existing benchmarks for LLM-based code generation primarily focus on functional correctness while overlooking hardware resource usage. Furthermore, current benchmarks offer limited diversity and do not fully represent the wide range of real-world FPGA applications. To address these shortcomings, we introduce ResBench, the first resource-focused benchmark explicitly designed to distinguish between resource-optimized and inefficient LLM-generated HDL code. ResBench consists of 56 problems across 12 categories, covering applications from finite state machines to financial computing. Our open-source evaluation framework automatically tests LLMs by generating Verilog code, verifying correctness, and measuring resource usage. The experiments, which primarily analyze Lookup Table (LUT) usage, reveal significant differences among LLMs, demonstrating ResBench's capability to identify models that generate more resource-optimized FPGA designs.

ResBench: Benchmarking LLM-Generated FPGA Designs with Resource Awareness

TL;DR

ResBench addresses the scarcity of hardware-resource-aware evaluation for LLM-generated HDL by introducing a resource-centric FPGA benchmark and an automated framework that generates Verilog, validates functionality, and measures resource usage, notably LUTs, across 56 problems in 12 domains. It compares nine LLMs spanning general-purpose, code-specialized, and HDL-specialized families, revealing substantial variation in resource optimization capabilities and identifying models that produce more hardware-efficient designs. The key contributions are a 56-problem, 12-domain benchmark, an open-source evaluation pipeline, and an empirical study showing resource-awareness can distinguish LLMs beyond functional correctness. The work has practical impact for AI-assisted FPGA design, guiding model development toward hardware-aware HDL generation and enabling repeatable, scalable hardware-resource benchmarking.

Abstract

Field-Programmable Gate Arrays (FPGAs) are widely used in modern hardware design, yet writing Hardware Description Language (HDL) code for FPGA implementation remains a complex and time-consuming task. Large Language Models (LLMs) have emerged as a promising tool for HDL generation, but existing benchmarks for LLM-based code generation primarily focus on functional correctness while overlooking hardware resource usage. Furthermore, current benchmarks offer limited diversity and do not fully represent the wide range of real-world FPGA applications. To address these shortcomings, we introduce ResBench, the first resource-focused benchmark explicitly designed to distinguish between resource-optimized and inefficient LLM-generated HDL code. ResBench consists of 56 problems across 12 categories, covering applications from finite state machines to financial computing. Our open-source evaluation framework automatically tests LLMs by generating Verilog code, verifying correctness, and measuring resource usage. The experiments, which primarily analyze Lookup Table (LUT) usage, reveal significant differences among LLMs, demonstrating ResBench's capability to identify models that generate more resource-optimized FPGA designs.

Paper Structure

This paper contains 15 sections, 2 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Benchmark example illustrating HDL optimization capability using the expression $(a+b)^2 - (a-b)^2$. (a) Qwen-2.5 computes the full expression directly, leading to high LUT usage. (b) GPT-4 simplifies the expression to $4ab$, significantly reducing resource usage by using a single DSP unit instead of LUTs. This example demonstrates ResBench's ability to differentiate LLMs based on resource optimization.
  • Figure 2: Overview of the software workflow. The process begins with Verilog generation using an LLM, followed by functional verification through testbenches. Functionally correct designs undergo FPGA synthesis to extract resource usage metrics, and the framework compiles performance reports comparing functional correctness and resource usage.