MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation

Yongan Zhang; Zhongzhi Yu; Yonggan Fu; Cheng Wan; Yingyan Celine Lin

MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation

Yongan Zhang, Zhongzhi Yu, Yonggan Fu, Cheng Wan, Yingyan Celine Lin

TL;DR

The paper addresses the lack of large, detailed public hardware datasets for training LLMs in Verilog generation. It proposes MG-Verilog, a multi-grained dataset with varying levels of description detail and corresponding code samples, plus open-source infrastructure. A balanced fine-tuning scheme is introduced to leverage diverse detail levels and improve generalization. Experiments show MG-Verilog-tuned models achieve higher Verilog-generation accuracy and robustness than baselines, signaling practical gains for hardware design workflows.

Abstract

Large Language Models (LLMs) have recently shown promise in streamlining hardware design processes by encapsulating vast amounts of domain-specific data. In addition, they allow users to interact with the design processes through natural language instructions, thus making hardware design more accessible to developers. However, effectively leveraging LLMs in hardware design necessitates providing domain-specific data during inference (e.g., through in-context learning), fine-tuning, or pre-training. Unfortunately, existing publicly available hardware datasets are often limited in size, complexity, or detail, which hinders the effectiveness of LLMs in hardware design tasks. To address this issue, we first propose a set of criteria for creating high-quality hardware datasets that can effectively enhance LLM-assisted hardware design. Based on these criteria, we propose a Multi-Grained-Verilog (MG-Verilog) dataset, which encompasses descriptions at various levels of detail and corresponding code samples. To benefit the broader hardware design community, we have developed an open-source infrastructure that facilitates easy access, integration, and extension of the dataset to meet specific project needs. Furthermore, to fully exploit the potential of the MG-Verilog dataset, which varies in complexity and detail, we introduce a balanced fine-tuning scheme. This scheme serves as a unique use case to leverage the diverse levels of detail provided by the dataset. Extensive experiments demonstrate that the proposed dataset and fine-tuning scheme consistently improve the performance of LLMs in hardware design tasks.

MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation

TL;DR

Abstract

Paper Structure (18 sections, 3 figures, 1 table)

This paper contains 18 sections, 3 figures, 1 table.

Introduction
Criteria for Datasets in LLM-assisted Hardware Design
The Proposed MG-Verilog Dataset
Dataset Overview
Dataset Construction
Data Collection and Preprocessing
Description Generation
Multi-grained Dataset Structure
Detailed Statistics of the Dataset
Dataset Access and Extension Instructions
Dataset Unique Use Case: A Balanced Fine-tuning Scheme
Experimental Results
Experiment Setup
Ablation Study on Different Evaluation Settings
Ablations on the Number of Training Samples
...and 3 more sections

Figures (3)

Figure 1: Illustrating the proposed MG-Verilog dataset structure and examples of varying levels of detail.
Figure 2: The detailed statistics of the MG-Verilog dataset, using the tokenizer from the GPT-3.5-Turbo model gpt35.
Figure 3: Pass rates of the generated RTL code from fine-tuned CodeLLaMA-7B-Instruct model using different numbers of training samples. Here only detailed global summaries of the code are used during the fine-tuning.

MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation

TL;DR

Abstract

MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)