MAGE: A Multi-Agent Engine for Automated RTL Code Generation
Yujie Zhao, Hejia Zhang, Hanxian Huang, Zhongming Yu, Jishen Zhao
TL;DR
This work tackles the challenge of reliably generating synthesizable Verilog RTL code with large language models by introducing MAGE, an open-source multi-agent system that partitions RTL design into specialized roles for code generation, testbench creation, judging, and debugging. It augments this architecture with a high-temperature candidate sampling mechanism and a Verilog-state checkpointing feedback scheme to explore diverse RTL candidates and provide precise, actionable debugging feedback. Empirical results on VerilogEval benchmarks show MAGE achieving up to 95.7% Pass@1, outperforming both vanilla LLMs and prior RTL design systems, with ablations highlighting the advantages of task partitioning and principled debugging. Overall, MAGE offers a robust, open framework that can significantly streamline AI-assisted RTL design workflows and improve functional correctness in automated hardware generation.
Abstract
The automatic generation of RTL code (e.g., Verilog) through natural language instructions has emerged as a promising direction with the advancement of large language models (LLMs). However, producing RTL code that is both syntactically and functionally correct remains a significant challenge. Existing single-LLM-agent approaches face substantial limitations because they must navigate between various programming languages and handle intricate generation, verification, and modification tasks. To address these challenges, this paper introduces MAGE, the first open-source multi-agent AI system designed for robust and accurate Verilog RTL code generation. We propose a novel high-temperature RTL candidate sampling and debugging system that effectively explores the space of code candidates and significantly improves the quality of the candidates. Furthermore, we design a novel Verilog-state checkpoint checking mechanism that enables early detection of functional errors and delivers precise feedback for targeted fixes, significantly enhancing the functional correctness of the generated RTL code. MAGE achieves a 95.7% rate of syntactic and functional correctness code generation on VerilogEval-Human 2 benchmark, surpassing the state-of-the-art Claude-3.5-sonnet by 23.3 %, demonstrating a robust and reliable approach for AI-driven RTL design workflows.
