Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generation
Kaiyan Chang, Zhirong Chen, Yunhao Zhou, Wenlong Zhu, kun wang, Haobo Xu, Cangyuan Li, Mengdi Wang, Shengwen Liang, Huawei Li, Yinhe Han, Ying Wang
TL;DR
This paper shows that natural language alone is insufficient for Verilog generation in spatially complex hardware and proposes an open-source multi-modal benchmark plus a Verilog Large Model Query Language (VLMQL) for vision-language co-design. It formalizes a benchmark framework with hierarchical difficulty, multi-level prompting, and fine-grained token metrics to evaluate multi-modal models, and demonstrates significant improvements in syntax and functional correctness over NL-only baselines using GPT-4V and LLaMA variants. The work provides practical tooling and datasets to drive progress in hardware design with large multimodal models, suggesting that visual context can substantially reduce misalignment and improve design fidelity. Overall, it enables a more scalable and diversified approach to hardware design in the era of large hardware-design models by standardizing evaluation and facilitating efficient multi-modal generation workflows.
Abstract
Natural language interfaces have exhibited considerable potential in the automation of Verilog generation derived from high-level specifications through the utilization of large language models, garnering significant attention. Nevertheless, this paper elucidates that visual representations contribute essential contextual information critical to design intent for hardware architectures possessing spatial complexity, potentially surpassing the efficacy of natural-language-only inputs. Expanding upon this premise, our paper introduces an open-source benchmark for multi-modal generative models tailored for Verilog synthesis from visual-linguistic inputs, addressing both singular and complex modules. Additionally, we introduce an open-source visual and natural language Verilog query language framework to facilitate efficient and user-friendly multi-modal queries. To evaluate the performance of the proposed multi-modal hardware generative AI in Verilog generation tasks, we compare it with a popular method that relies solely on natural language. Our results demonstrate a significant accuracy improvement in the multi-modal generated Verilog compared to queries based solely on natural language. We hope to reveal a new approach to hardware design in the large-hardware-design-model era, thereby fostering a more diversified and productive approach to hardware design.
