LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation
Hejia Zhang, Zhongming Yu, Chia-Tung Ho, Haoxing Ren, Brucek Khailany, Jishen Zhao
TL;DR
The paper addresses the challenge of learning high-coverage hardware verification policies under expensive, non-differentiable simulator feedback. It proposes LLM4Cov, an offline execution-grounded framework that models verification as memoryless state transitions $s_t = (\mathcal{R}, x_t, o_t)$ with a scalar $\mathrm{Cov}(s_t) \in [0,1]$, and builds offline supervision through execution-validated data curation, coverage-guided agentic rejection fine-tuning, and verification-conditioned progressive learning. On the reality-aligned CVDP-ECov benchmark, a compact 4B model achieves $69.2\%$ coverage pass and $90.4\%$ average coverage, outperforming a $30$B teacher and approaching results of much larger models. The results demonstrate that specialized agentic supervision under execution constraints can rival large-scale scaling, and the approach remains compatible with RL or online fine-tuning when simulator budgets permit.
Abstract
Execution-aware LLM agents offer a promising paradigm for learning from tool feedback, but such feedback is often expensive and slow to obtain, making online reinforcement learning (RL) impractical. High-coverage hardware verification exemplifies this challenge due to its reliance on industrial simulators and non-differentiable execution signals. We propose LLM4Cov, an offline agent-learning framework that models verification as memoryless state transitions guided by deterministic evaluators. Building on this formulation, we introduce execution-validated data curation, policy-aware agentic data synthesis, and worst-state-prioritized sampling to enable scalable learning under execution constraints. We further curate a reality-aligned benchmark adapted from an existing verification suite through a revised evaluation protocol. Using the proposed pipeline, a compact 4B-parameter model achieves 69.2% coverage pass rate under agentic evaluation, outperforming its teacher by 5.3% and demonstrating competitive performance against models an order of magnitude larger.
