Table of Contents
Fetching ...

MAGE: A Multi-Agent Engine for Automated RTL Code Generation

Yujie Zhao, Hejia Zhang, Hanxian Huang, Zhongming Yu, Jishen Zhao

TL;DR

This work tackles the challenge of reliably generating synthesizable Verilog RTL code with large language models by introducing MAGE, an open-source multi-agent system that partitions RTL design into specialized roles for code generation, testbench creation, judging, and debugging. It augments this architecture with a high-temperature candidate sampling mechanism and a Verilog-state checkpointing feedback scheme to explore diverse RTL candidates and provide precise, actionable debugging feedback. Empirical results on VerilogEval benchmarks show MAGE achieving up to 95.7% Pass@1, outperforming both vanilla LLMs and prior RTL design systems, with ablations highlighting the advantages of task partitioning and principled debugging. Overall, MAGE offers a robust, open framework that can significantly streamline AI-assisted RTL design workflows and improve functional correctness in automated hardware generation.

Abstract

The automatic generation of RTL code (e.g., Verilog) through natural language instructions has emerged as a promising direction with the advancement of large language models (LLMs). However, producing RTL code that is both syntactically and functionally correct remains a significant challenge. Existing single-LLM-agent approaches face substantial limitations because they must navigate between various programming languages and handle intricate generation, verification, and modification tasks. To address these challenges, this paper introduces MAGE, the first open-source multi-agent AI system designed for robust and accurate Verilog RTL code generation. We propose a novel high-temperature RTL candidate sampling and debugging system that effectively explores the space of code candidates and significantly improves the quality of the candidates. Furthermore, we design a novel Verilog-state checkpoint checking mechanism that enables early detection of functional errors and delivers precise feedback for targeted fixes, significantly enhancing the functional correctness of the generated RTL code. MAGE achieves a 95.7% rate of syntactic and functional correctness code generation on VerilogEval-Human 2 benchmark, surpassing the state-of-the-art Claude-3.5-sonnet by 23.3 %, demonstrating a robust and reliable approach for AI-driven RTL design workflows.

MAGE: A Multi-Agent Engine for Automated RTL Code Generation

TL;DR

This work tackles the challenge of reliably generating synthesizable Verilog RTL code with large language models by introducing MAGE, an open-source multi-agent system that partitions RTL design into specialized roles for code generation, testbench creation, judging, and debugging. It augments this architecture with a high-temperature candidate sampling mechanism and a Verilog-state checkpointing feedback scheme to explore diverse RTL candidates and provide precise, actionable debugging feedback. Empirical results on VerilogEval benchmarks show MAGE achieving up to 95.7% Pass@1, outperforming both vanilla LLMs and prior RTL design systems, with ablations highlighting the advantages of task partitioning and principled debugging. Overall, MAGE offers a robust, open framework that can significantly streamline AI-assisted RTL design workflows and improve functional correctness in automated hardware generation.

Abstract

The automatic generation of RTL code (e.g., Verilog) through natural language instructions has emerged as a promising direction with the advancement of large language models (LLMs). However, producing RTL code that is both syntactically and functionally correct remains a significant challenge. Existing single-LLM-agent approaches face substantial limitations because they must navigate between various programming languages and handle intricate generation, verification, and modification tasks. To address these challenges, this paper introduces MAGE, the first open-source multi-agent AI system designed for robust and accurate Verilog RTL code generation. We propose a novel high-temperature RTL candidate sampling and debugging system that effectively explores the space of code candidates and significantly improves the quality of the candidates. Furthermore, we design a novel Verilog-state checkpoint checking mechanism that enables early detection of functional errors and delivers precise feedback for targeted fixes, significantly enhancing the functional correctness of the generated RTL code. MAGE achieves a 95.7% rate of syntactic and functional correctness code generation on VerilogEval-Human 2 benchmark, surpassing the state-of-the-art Claude-3.5-sonnet by 23.3 %, demonstrating a robust and reliable approach for AI-driven RTL design workflows.

Paper Structure

This paper contains 14 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: (a) The overview of MAGE, (b) the roles of four types of agents, (c) state check module, and (d) sampling and debugging module.
  • Figure 2: Normalized mismatch count of generated testbenches at different stages under varying temperature configurations (Low temperature: $T=0$, $n=1$; High temperature: $T=0.85$, $n=20$), using the Claude 3.5 Sonnet model (dated 2024-10-22) across two benchmarks: VerilogEval-v1-Human liu2023verilogeval and VerilogEval-v2 pinckney2024revisiting. Problems that directly passed before Step and those with zero mean mismatches in both configurations are not shown in the figure. The blue violin plot shows that candidates (blue dots) generated with high-temperature sampling typically have lower mean mismatch counts across most problems compared to those generated with low temperatures (purple dots).
  • Figure 3: The Case Study of RTL Code State Checkpoint on Prob093-ece241-2014-q3.
  • Figure 4: Score $S(r)$ improvement of RTL by sampling and debugging. (a) Score distribution: generated RTL without sampling versus sampled and selected best RTL candidate; (b) Score distribution and the mean score of generated RTL in each debug round. Data of problems fixed before entering the debug stage are not included.