Exploring the Agentic Frontier of Verilog Code Generation

Patrick Yubeaton; Chinmay Hegde; Siddharth Garg

Exploring the Agentic Frontier of Verilog Code Generation

Patrick Yubeaton, Chinmay Hegde, Siddharth Garg

Abstract

Large language models (LLMs) have made rapid advancements in code generation for popular languages such as Python and C++. Many of these recent gains can be attributed to the use of ``agents'' that wrap domain-relevant tools alongside LLMs. Hardware design languages such as Verilog have also seen improved code generation in recent years, but the impact of agentic frameworks on Verilog code generation tasks remains unclear. In this work, we present the first systematic evaluation of agentic LLMs for Verilog generation, using the recently introduced CVDP benchmark. We also introduce several open-source hardware design agent harnesses, providing a model-agnostic baseline for future work. Through controlled experiments across frontier models, we study how structured prompting and tool design affect performance, analyze agent failure modes and tool usage patterns, compare open-source and closed-source models, and provide qualitative examples of successful and failed agent runs. Our results show that naive agentic wrapping around frontier models can degrade performance (relative to standard forward passes with optimized prompts), but that structured harnesses meaningfully match and in some cases exceed non-agentic baselines. We find that the performance gap between open and closed source models is driven by both higher crash rates and weaker tool output interpretation. Our exploration illuminates the path towards designing special-purpose agents for verilog generation in the future.

Exploring the Agentic Frontier of Verilog Code Generation

Abstract

Paper Structure (25 sections, 6 figures, 6 tables)

This paper contains 25 sections, 6 figures, 6 tables.

Introduction
Background & Related Work
Benchmarking LLM Verilog Generation
LLMs and Terminal Agents
Creating a Verilog Agent
Experiments
RQ1: Non-Agentic Performance of Frontier LLMs
RQ2: Does Agentic Tool Use Help?
RQ3: Can Structured Prompting and Expanded Tooling Improve the Agent?
RQ4: Agent Failure Modes and Tool Usage Patterns
Agent Completion and Crash Rates
Crash Predictiveness of Failure
Failure Mode Taxonomy
Tool Usage and Correctness
Open-Source vs. Closed-Source Models
...and 10 more sections

Figures (6)

Figure 1: Abbreviated agent trace for the binary-to-Gray task (Gemini 3.1 Pro, Mod 1). The agent correctly ignores testbench warnings and verifies its own RTL independently.
Figure 2: RTL produced by the agent for the binary-to-Gray task. A single combinational assignment correctly implements the Gray code conversion.
Figure 3: Abbreviated agent trace for the cellular automata task (Gemini 3.1 Pro, baseline). Large simulation output grows the context past the 1M token limit, causing a hard crash.
Figure 4: Baseline system prompt, adapted from the CVDP benchmark.
Figure 5: Updated structured system prompt with five-step verification loop.
...and 1 more figures

Exploring the Agentic Frontier of Verilog Code Generation

Abstract

Exploring the Agentic Frontier of Verilog Code Generation

Authors

Abstract

Table of Contents

Figures (6)