Table of Contents
Fetching ...

David vs. Goliath: Can Small Models Win Big with Agentic AI in Hardware Design?

Shashwat Shankar, Subhranshu Pandey, Innocent Dengkhw Mochahari, Bhabesh Mali, Animesh Basak Chowdhury, Sukanta Bhattacharjee, Chandan Karfa

TL;DR

This paper tackles the high energy and cost burden of LLM-driven hardware design by evaluating small language models (SLMs) within a tailored agentic AI framework on the CVDP benchmark. It introduces a five-agent pipeline—Planning, Prompt Engineering, Code Generation, Validation, and Adaptive Feedback—that decomposes tasks, provides structured guidance, and iterates refinement to compensate for SLM limitations. Empirical results show meaningful gains: certain SLMs with agentic scaffolding reach near-LLM performance on code generation and comprehension tasks, with substantial efficiency advantages and lower energy footprints. The findings support a strategy-over-scale approach for AI-assisted hardware design and propose an open-source pathway to broaden access and optimization for sustainable design workflows.

Abstract

Large Language Model(LLM) inference demands massive compute and energy, making domain-specific tasks expensive and unsustainable. As foundation models keep scaling, we ask: Is bigger always better for hardware design? Our work tests this by evaluating Small Language Models coupled with a curated agentic AI framework on NVIDIA's Comprehensive Verilog Design Problems(CVDP) benchmark. Results show that agentic workflows: through task decomposition, iterative feedback, and correction - not only unlock near-LLM performance at a fraction of the cost but also create learning opportunities for agents, paving the way for efficient, adaptive solutions in complex design tasks.

David vs. Goliath: Can Small Models Win Big with Agentic AI in Hardware Design?

TL;DR

This paper tackles the high energy and cost burden of LLM-driven hardware design by evaluating small language models (SLMs) within a tailored agentic AI framework on the CVDP benchmark. It introduces a five-agent pipeline—Planning, Prompt Engineering, Code Generation, Validation, and Adaptive Feedback—that decomposes tasks, provides structured guidance, and iterates refinement to compensate for SLM limitations. Empirical results show meaningful gains: certain SLMs with agentic scaffolding reach near-LLM performance on code generation and comprehension tasks, with substantial efficiency advantages and lower energy footprints. The findings support a strategy-over-scale approach for AI-assisted hardware design and propose an open-source pathway to broaden access and optimization for sustainable design workflows.

Abstract

Large Language Model(LLM) inference demands massive compute and energy, making domain-specific tasks expensive and unsustainable. As foundation models keep scaling, we ask: Is bigger always better for hardware design? Our work tests this by evaluating Small Language Models coupled with a curated agentic AI framework on NVIDIA's Comprehensive Verilog Design Problems(CVDP) benchmark. Results show that agentic workflows: through task decomposition, iterative feedback, and correction - not only unlock near-LLM performance at a fraction of the cost but also create learning opportunities for agents, paving the way for efficient, adaptive solutions in complex design tasks.

Paper Structure

This paper contains 21 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Chip-design workflow for a commercial-grade SoC in a fabless semiconductor organization. YoE indicates years of experience. Task-specific SLMs integrated into a well-architected agentic-AI framework are appropriate options at beginner-level tasks to excel automation with explicit objectives, workflows, and evaluation metrics.
  • Figure 2: Proposed SLM-aware agentic AI framework. (1) PPA retrieves and structures context from the dataset; (2) SPEA constructs SLM-aware prompts using keyword injection, in-context examples, and token budgeting; (3) CA generates candidate RTL implementations; (4) VA performs syntax checking, I/O port usage analysis, and functional testing; and (5) AFA categorizes errors, evaluates quality, and produces structured refinement prompts. These agents form a closed-loop iterative workflow.
  • Figure 3: GPT-o4 mini vs. Deepseek-r1 (7B) response for the same prompt (Problem ID: $cvdp\_copilot\_16qam\_mapper\_0004$).