OpenLLM-RTL: Open Dataset and Benchmark for LLM-Aided Design RTL Generation
Shang Liu, Yao Lu, Wenji Fang, Mengming Li, Zhiyao Xie
TL;DR
This work provides a three-pronged open framework to advance LLM-assisted RTL design: RTLLM 2.0 offers a 50-design benchmark for RTL generation across diverse modules; AssertEval enables open-ended evaluation of LLM-generated assertions for RTL verification with FPV; RTLCoder-Data delivers a large, instruction-to-code RTL dataset (80K raw samples, 7K verified) augmented by a verification-based correctness check. Together, these resources enable end-to-end development and fair evaluation of LLMs for RTL code generation and verification, and experiments show that increasing data size and improving data quality—especially via verification-based filtering—boost LLM performance. The work discusses practical challenges such as design complexity, data leakage, and assertion quality, and highlights the potential for democratizing EDA research through open datasets and benchmarks. Overall, the paper demonstrates that careful dataset design, open benchmarks, and verification-driven data curation substantially advance LLM-aided RTL generation and verification.
Abstract
The automated generation of design RTL based on large language model (LLM) and natural language instructions has demonstrated great potential in agile circuit design. However, the lack of datasets and benchmarks in the public domain prevents the development and fair evaluation of LLM solutions. This paper highlights our latest advances in open datasets and benchmarks from three perspectives: (1) RTLLM 2.0, an updated benchmark assessing LLM's capability in design RTL generation. The benchmark is augmented to 50 hand-crafted designs. Each design provides the design description, test cases, and a correct RTL code. (2) AssertEval, an open-source benchmark assessing the LLM's assertion generation capabilities for RTL verification. The benchmark includes 18 designs, each providing specification, signal definition, and correct RTL code. (3) RTLCoder-Data, an extended open-source dataset with 80K instruction-code data samples. Moreover, we propose a new verification-based method to verify the functionality correctness of training data samples. Based on this technique, we further release a dataset with 7K verified high-quality samples. These three studies are integrated into one framework, providing off-the-shelf support for the development and evaluation of LLMs for RTL code generation and verification. Finally, extensive experiments indicate that LLM performance can be boosted by enlarging the training dataset, improving data quality, and improving the training scheme.
