Table of Contents
Fetching ...

Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

Abhishek Bhandwaldar, Mihir Choudhury, Ruchir Puri, Akash Srivastava

Abstract

We present an empirical study of how far general-purpose coding agents -- without hardware-specific training -- can optimize hardware designs from high-level algorithmic specifications. We introduce an agent factory, a two-stage pipeline that constructs and coordinates multiple autonomous optimization agents. In Stage~1, the pipeline decomposes a design into sub-kernels, independently optimizes each using pragma and code-level transformations, and formulates an Integer Linear Program (ILP) to assemble globally promising configurations under an area constraint. In Stage~2, it launches $N$ expert agents over the top ILP solutions, each exploring cross-function optimizations such as pragma recombination, loop fusion, and memory restructuring that are not captured by sub-kernel decomposition. We evaluate the approach on 12 kernels from HLS-Eval and Rodinia-HLS using Claude Code (Opus~4.5/4.6) with AMD Vitis HLS. Scaling from 1 to 10 agents yields a mean $8.27\times$ speedup over baseline, with larger gains on harder benchmarks: streamcluster exceeds $20\times$ and kmeans reaches approximately $10\times$. Across benchmarks, agents consistently rediscover known hardware optimization patterns without domain-specific training, and the best designs often do not originate from top-ranked ILP candidates, indicating that global optimization exposes improvements missed by sub-kernel search. These results establish agent scaling as a practical and effective axis for HLS optimization.

Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

Abstract

We present an empirical study of how far general-purpose coding agents -- without hardware-specific training -- can optimize hardware designs from high-level algorithmic specifications. We introduce an agent factory, a two-stage pipeline that constructs and coordinates multiple autonomous optimization agents. In Stage~1, the pipeline decomposes a design into sub-kernels, independently optimizes each using pragma and code-level transformations, and formulates an Integer Linear Program (ILP) to assemble globally promising configurations under an area constraint. In Stage~2, it launches expert agents over the top ILP solutions, each exploring cross-function optimizations such as pragma recombination, loop fusion, and memory restructuring that are not captured by sub-kernel decomposition. We evaluate the approach on 12 kernels from HLS-Eval and Rodinia-HLS using Claude Code (Opus~4.5/4.6) with AMD Vitis HLS. Scaling from 1 to 10 agents yields a mean speedup over baseline, with larger gains on harder benchmarks: streamcluster exceeds and kmeans reaches approximately . Across benchmarks, agents consistently rediscover known hardware optimization patterns without domain-specific training, and the best designs often do not originate from top-ranked ILP candidates, indicating that global optimization exposes improvements missed by sub-kernel search. These results establish agent scaling as a practical and effective axis for HLS optimization.

Paper Structure

This paper contains 28 sections, 3 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Two-stage agent-based pipeline for HLS design space exploration. Given an input design $\mathcal{D}$, a coordinator agent extracts the function call graph $G$ and spawns one optimizer agent per sub-function $f_1, \dots, f_K$. Variants are evaluated for correctness and synthesized to obtain (latency, area) pairs. An ILP solver then selects the top-$N$ combinations $\mathcal{S} = \{s_1, \dots, s_N\}$ that minimize total latency subject to the area budget. In Stage 2, $N$ exploration agents each start from a candidate solution and iteratively apply design-wide optimization paths, to produce the final optimized design $\mathcal{D}^*$.
  • Figure 2: Pareto front results for all twelve benchmarks under agent scaling ($N \in \{1,2,4,8,10\}$). Each subplot shows speedup over baseline (y-axis) versus area (x-axis). Increasing the number of agents extends the Pareto front toward lower latency and more favorable area--latency trade-offs across most benchmarks, with the strongest gains on harder problems such as streamcluster, leukocyte, and NW.
  • Figure 3: Latency improvement factor over baseline versus number of expert agents across six benchmarks. Improvements range from 1.4$\times$ to 14.5$\times$ and generally increase with agent count, typically plateauing after four agents.
  • Figure 4: Average Inference cost of agent scaling. Each session is combination of Opus 4.5/4.6.