Table of Contents
Fetching ...

VeriInteresting: An Empirical Study of Model Prompt Interactions in Verilog Code Generation

Luca Collini, Andrew Hennesee, Patrick Yubeaton, Siddharth Garg, Ramesh Karri

TL;DR

This work provides an empirical map of recent trends in LMs for Verilog code generation, focusing on interactions among model reasoning, specialization, and prompt engineering strategies, and identifies patterns in how model classes respond to structured prompts and optimization.

Abstract

Rapid advances in language models (LMs) have created new opportunities for automated code generation while complicating trade-offs between model characteristics and prompt design choices. In this work, we provide an empirical map of recent trends in LMs for Verilog code generation, focusing on interactions among model reasoning, specialization, and prompt engineering strategies. We evaluate a diverse set of small and large LMs, including general-purpose, reasoning, and domain-specific variants. Our experiments use a controlled factorial design spanning benchmark prompts, structured outputs, prompt rewriting, chain-of-thought reasoning, in-context learning, and evolutionary prompt optimization via Genetic-Pareto. Across two Verilog benchmarks, we identify patterns in how model classes respond to structured prompts and optimization, and we document which trends generalize across LMs and benchmarks versus those that are specific to particular model-prompt combinations.

VeriInteresting: An Empirical Study of Model Prompt Interactions in Verilog Code Generation

TL;DR

This work provides an empirical map of recent trends in LMs for Verilog code generation, focusing on interactions among model reasoning, specialization, and prompt engineering strategies, and identifies patterns in how model classes respond to structured prompts and optimization.

Abstract

Rapid advances in language models (LMs) have created new opportunities for automated code generation while complicating trade-offs between model characteristics and prompt design choices. In this work, we provide an empirical map of recent trends in LMs for Verilog code generation, focusing on interactions among model reasoning, specialization, and prompt engineering strategies. We evaluate a diverse set of small and large LMs, including general-purpose, reasoning, and domain-specific variants. Our experiments use a controlled factorial design spanning benchmark prompts, structured outputs, prompt rewriting, chain-of-thought reasoning, in-context learning, and evolutionary prompt optimization via Genetic-Pareto. Across two Verilog benchmarks, we identify patterns in how model classes respond to structured prompts and optimization, and we document which trends generalize across LMs and benchmarks versus those that are specific to particular model-prompt combinations.
Paper Structure (24 sections, 9 figures, 1 table)

This paper contains 24 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: RQ1: Scaling and specialization for Verilog generation. Pass@1 normalized between each model’s baseline prompt performance and its best-performing prompting configuration, plotted against parameter count for two benchmarks: VeriThoughts (left) and Verilog Eval v2 (right). Curves show the Qwen family (Qw), Qwen-Coder (QwC), and Verilog-adapted models (VR and VT); Star denotes DeepSeek-CoderV2-16B. Horizontal reference lines correspond to selected commercial models.
  • Figure 2: RQ2: Sensitivity to prompting strategies across LMs. Each cell reports the change in Pass@10 (percentage points) relative to the baseline prompt (Base), for four prompting variants (Struct, Struct CoT, Refine, Refine CoT) evaluated without ICL. Results are shown for VeriThoughts (left) and Verilog Eval v2 (right); blue indicates gains and red indicates degradations.
  • Figure 3: RQ2: Impact of structured output prompting. For each LM, we report P@1/P@5/P@10 under baseline (Baseline), structured-output variant (Struct), and chain-of-thought extension (Struct CoT). Results for VeriThoughts (top) and VerilogEval v2 (bottom).
  • Figure 4: RQ2: Impact of prompt refinement. For each LM, we report P@1/P@5/P@10 under baseline (Baseline) vs two-stage refinement pipeline (Refine) and chain-of-thought variant (Refine CoT). Results for VeriThoughts (top) and VerilogEval v2 (bottom).
  • Figure 5: RQ2: Impact of ICL. For each model and prompt family (Baseline, Struct, and Refine), we compare zero-shot prompting against few-shot ICL, reporting P@1/P@5/P@10 on VeriThoughts (top) and Verilog Eval v2 (bottom).
  • ...and 4 more figures