Assessing Large Language Models in Generating RTL Design Specifications
Hung-Ming Huang, Yu-Hsin Yang, Fu-Chieh Chang, Yun-Chia Hsu, Yin-Yu Lin, Ming-Fang Tsai, Chun-Chih Yang, Pei-Yuan Wu
TL;DR
The paper tackles the challenge of generating human-readable specifications from RTL code to improve understanding and documentation. It introduces prompting strategies and hardware-aware evaluation metrics—GPT-RTL Score and RTL-Reconstruction Score—to quantify specification fidelity, and benchmarks open-source and commercial LLMs on VerilogEval-V2 and RTLLM-2.0. Results show that structured prompting, especially multi-step reasoning, enhances specification quality, and LLM-based metrics correlate with reconstruction-based validation, suggesting practical guidance for automated hardware documentation workflows. The work establishes a framework for systematic RTL-to-specification evaluation and highlights differences in model capabilities across scales and families.
Abstract
As IC design grows more complex, automating comprehension and documentation of RTL code has become increasingly important. Engineers currently should manually interpret existing RTL code and write specifications, a slow and error-prone process. Although LLMs have been studied for generating RTL from specifications, automated specification generation remains underexplored, largely due to the lack of reliable evaluation methods. To address this gap, we investigate how prompting strategies affect RTL-to-specification quality and introduce metrics for faithfully evaluating generated specs. We also benchmark open-source and commercial LLMs, providing a foundation for more automated and efficient specification workflows in IC design.
