DeepCircuitX: A Comprehensive Repository-Level Dataset for RTL Code Understanding, Generation, and PPA Analysis
Zeju Li, Changran Xu, Zhengyuan Shi, Zedong Peng, Yi Liu, Yunhao Zhou, Lingfeng Zhou, Chengyu Ma, Jianyuan Zhong, Xi Wang, Jieru Zhao, Zhufei Chu, Xiaoyan Yang, Qiang Xu
TL;DR
DeepCircuitX tackles the lack of repository-scale, multimodal RTL datasets by delivering a four-level dataset (repository, file, module, block) enriched with Chain-of-Thought annotations and integrated PPA data. The approach combines large-scale data collection from GitHub, structured CoT annotations via GPT-4 and Claude, and circuit synthesis to produce netlists, SDFs, and PPA metrics, enabling end-to-end evaluation of RTL-focused LLMs. Fine-tuning multiple LLMs on this resource yields strong gains for RTL understanding, completion, and generation, while PPA prediction benefits from repository-level context though early-stage timing remains challenging. By enabling pre-training, evaluation benchmarks, and multimodal RTL analysis, DeepCircuitX provides a practical foundation for advancing AI-assisted hardware design automation.
Abstract
This paper introduces DeepCircuitX, a comprehensive repository-level dataset designed to advance RTL (Register Transfer Level) code understanding, generation, and power-performance-area (PPA) analysis. Unlike existing datasets that are limited to either file-level RTL code or physical layout data, DeepCircuitX provides a holistic, multilevel resource that spans repository, file, module, and block-level RTL code. This structure enables more nuanced training and evaluation of large language models (LLMs) for RTL-specific tasks. DeepCircuitX is enriched with Chain of Thought (CoT) annotations, offering detailed descriptions of functionality and structure at multiple levels. These annotations enhance its utility for a wide range of tasks, including RTL code understanding, generation, and completion. Additionally, the dataset includes synthesized netlists and PPA metrics, facilitating early-stage design exploration and enabling accurate PPA prediction directly from RTL code. We demonstrate the dataset's effectiveness on various LLMs finetuned with our dataset and confirm the quality with human evaluations. Our results highlight DeepCircuitX as a critical resource for advancing RTL-focused machine learning applications in hardware design automation.Our data is available at https://zeju.gitbook.io/lcm-team.
