Table of Contents
Fetching ...

CodeV: Empowering LLMs with HDL Generation through Multi-Level Summarization

Yang Zhao, Di Huang, Chongxiao Li, Pengwei Jin, Muxin Song, Yinan Xu, Ziyuan Nan, Mingju Gao, Tianyun Ma, Lei Qi, Yansong Pan, Zhenxing Zhang, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Xing Hu

TL;DR

CodeV tackles the scarcity and quality problems in HDL training data by building a high-quality Verilog/Chisel dataset through multi-level summarization, deduplication, and syntax verification, then fine-tuning LLMs with a Chat-FIM-Tag approach to support chat and fill-in-middle tasks across both languages. The approach yields open-source HDL generation models that achieve state-of-the-art results on VerilogEval and competitive performance on multi-language benchmarks, while expanding capabilities to Chisel through additional data and targeted fine-tuning. The work also introduces robust evaluation benchmarks for multilingual and multi-scenario HDL generation and demonstrates the value of MLS and language tags in low-data regimes, with plans to release the dataset and models to accelerate progress in HDL automation.

Abstract

The design flow of processors, particularly in hardware description languages (HDL) like Verilog and Chisel, is complex and costly. While recent advances in large language models (LLMs) have significantly improved coding tasks in software languages such as Python, their application in HDL generation remains limited due to the scarcity of high-quality HDL data. Traditional methods of adapting LLMs for hardware design rely on synthetic HDL datasets, which often suffer from low quality because even advanced LLMs like GPT perform poorly in the HDL domain. Moreover, these methods focus solely on chat tasks and the Verilog language, limiting their application scenarios. In this paper, we observe that: (1) HDL code collected from the real world is of higher quality than code generated by LLMs. (2) LLMs like GPT-3.5 excel in summarizing HDL code rather than generating it. (3) An explicit language tag can help LLMs better adapt to the target language when there is insufficient data. Based on these observations, we propose an efficient LLM fine-tuning pipeline for HDL generation that integrates a multi-level summarization data synthesis process with a novel Chat-FIM-Tag supervised fine-tuning method. The pipeline enhances the generation of HDL code from natural language descriptions and enables the handling of various tasks such as chat and infilling incomplete code. Utilizing this pipeline, we introduce CodeV, a series of HDL generation LLMs. Among them, CodeV-All not only possesses a more diverse range of language abilities, i.e. Verilog and Chisel, and a broader scope of tasks, i.e. Chat and fill-in-middle (FIM), but it also achieves performance on VerilogEval that is comparable to or even surpasses that of CodeV-Verilog fine-tuned on Verilog only, making them the first series of open-source LLMs designed for multi-scenario HDL generation.

CodeV: Empowering LLMs with HDL Generation through Multi-Level Summarization

TL;DR

CodeV tackles the scarcity and quality problems in HDL training data by building a high-quality Verilog/Chisel dataset through multi-level summarization, deduplication, and syntax verification, then fine-tuning LLMs with a Chat-FIM-Tag approach to support chat and fill-in-middle tasks across both languages. The approach yields open-source HDL generation models that achieve state-of-the-art results on VerilogEval and competitive performance on multi-language benchmarks, while expanding capabilities to Chisel through additional data and targeted fine-tuning. The work also introduces robust evaluation benchmarks for multilingual and multi-scenario HDL generation and demonstrates the value of MLS and language tags in low-data regimes, with plans to release the dataset and models to accelerate progress in HDL automation.

Abstract

The design flow of processors, particularly in hardware description languages (HDL) like Verilog and Chisel, is complex and costly. While recent advances in large language models (LLMs) have significantly improved coding tasks in software languages such as Python, their application in HDL generation remains limited due to the scarcity of high-quality HDL data. Traditional methods of adapting LLMs for hardware design rely on synthetic HDL datasets, which often suffer from low quality because even advanced LLMs like GPT perform poorly in the HDL domain. Moreover, these methods focus solely on chat tasks and the Verilog language, limiting their application scenarios. In this paper, we observe that: (1) HDL code collected from the real world is of higher quality than code generated by LLMs. (2) LLMs like GPT-3.5 excel in summarizing HDL code rather than generating it. (3) An explicit language tag can help LLMs better adapt to the target language when there is insufficient data. Based on these observations, we propose an efficient LLM fine-tuning pipeline for HDL generation that integrates a multi-level summarization data synthesis process with a novel Chat-FIM-Tag supervised fine-tuning method. The pipeline enhances the generation of HDL code from natural language descriptions and enables the handling of various tasks such as chat and infilling incomplete code. Utilizing this pipeline, we introduce CodeV, a series of HDL generation LLMs. Among them, CodeV-All not only possesses a more diverse range of language abilities, i.e. Verilog and Chisel, and a broader scope of tasks, i.e. Chat and fill-in-middle (FIM), but it also achieves performance on VerilogEval that is comparable to or even surpasses that of CodeV-Verilog fine-tuned on Verilog only, making them the first series of open-source LLMs designed for multi-scenario HDL generation.
Paper Structure (31 sections, 4 equations, 10 figures, 12 tables)

This paper contains 31 sections, 4 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: The CodeV framework overview. We first collect and filter high-quality HDL modules from open-source codebases. The modules are then sent to GPT-3.5 to request multi-level summaries. Pairing high-level descriptions with corresponding modules, the high-quality dataset is utilized to fine-tune base LLMs, yielding CodeV models.
  • Figure 2: GPT-generated Verilog dataset contains unrealistic code. (a) A synthetic description obtained from RTLCoder liu2023rtlcoder, which is unrealistic. (b) The circuit's block diagram according to the description, which contains illegal loops in the circuit. (c) The corresponding incorrect Verilog code in the dataset.
  • Figure 3: An actual example of the prompt for multi-level summarization. (a) The prompt provided to GPT-3.5. (b) An example of the demonstrations, with code, low-level descriptions, and high-level summaries. (c) Summaries responded from GPT-3.5 with and (d) without multi-level summarization.
  • Figure 4: A detailed example showing all prompts used for multi-level summarization with Verilog dataset. (a) Demonstrations given to GPT-3.5. (b) A Verilog data snippet in CodeV generating corresponding Instructions, which can be adjusted by modifying the Code Snippet. (c) The response from GPT-3.5, consisting of the Description and Problem sections.
  • Figure 5: The similarity distribution of our dataset against benchmark data. Data with Rouge-L scores greater than $0.5$ are removed for decontamination.
  • ...and 5 more figures