Table of Contents
Fetching ...

Assessing Large Language Models in Generating RTL Design Specifications

Hung-Ming Huang, Yu-Hsin Yang, Fu-Chieh Chang, Yun-Chia Hsu, Yin-Yu Lin, Ming-Fang Tsai, Chun-Chih Yang, Pei-Yuan Wu

TL;DR

The paper tackles the challenge of generating human-readable specifications from RTL code to improve understanding and documentation. It introduces prompting strategies and hardware-aware evaluation metrics—GPT-RTL Score and RTL-Reconstruction Score—to quantify specification fidelity, and benchmarks open-source and commercial LLMs on VerilogEval-V2 and RTLLM-2.0. Results show that structured prompting, especially multi-step reasoning, enhances specification quality, and LLM-based metrics correlate with reconstruction-based validation, suggesting practical guidance for automated hardware documentation workflows. The work establishes a framework for systematic RTL-to-specification evaluation and highlights differences in model capabilities across scales and families.

Abstract

As IC design grows more complex, automating comprehension and documentation of RTL code has become increasingly important. Engineers currently should manually interpret existing RTL code and write specifications, a slow and error-prone process. Although LLMs have been studied for generating RTL from specifications, automated specification generation remains underexplored, largely due to the lack of reliable evaluation methods. To address this gap, we investigate how prompting strategies affect RTL-to-specification quality and introduce metrics for faithfully evaluating generated specs. We also benchmark open-source and commercial LLMs, providing a foundation for more automated and efficient specification workflows in IC design.

Assessing Large Language Models in Generating RTL Design Specifications

TL;DR

The paper tackles the challenge of generating human-readable specifications from RTL code to improve understanding and documentation. It introduces prompting strategies and hardware-aware evaluation metrics—GPT-RTL Score and RTL-Reconstruction Score—to quantify specification fidelity, and benchmarks open-source and commercial LLMs on VerilogEval-V2 and RTLLM-2.0. Results show that structured prompting, especially multi-step reasoning, enhances specification quality, and LLM-based metrics correlate with reconstruction-based validation, suggesting practical guidance for automated hardware documentation workflows. The work establishes a framework for systematic RTL-to-specification evaluation and highlights differences in model capabilities across scales and families.

Abstract

As IC design grows more complex, automating comprehension and documentation of RTL code has become increasingly important. Engineers currently should manually interpret existing RTL code and write specifications, a slow and error-prone process. Although LLMs have been studied for generating RTL from specifications, automated specification generation remains underexplored, largely due to the lack of reliable evaluation methods. To address this gap, we investigate how prompting strategies affect RTL-to-specification quality and introduce metrics for faithfully evaluating generated specs. We also benchmark open-source and commercial LLMs, providing a foundation for more automated and efficient specification workflows in IC design.

Paper Structure

This paper contains 27 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Minimal prompt (top) and specification-aware prompt (bottom).
  • Figure 2: Multi-step reasoning prompt.
  • Figure 3: Prompt for GPT Score
  • Figure 4: Prompt for GPT-RTL Score
  • Figure 5: RTL-Reconstruction (RR) score
  • ...and 3 more figures