KForge: Program Synthesis for Diverse AI Hardware Accelerators
Taras Sereda, Tom St. John, Burak Bartan, Natalie Serrino, Sachin Katti, Zain Asgar
TL;DR
KForge introduces a dual-agent, platform-agnostic framework for autonomous program synthesis of high-performance kernels across NVIDIA CUDA and Apple Metal. It couples a Program Synthesis Agent with a Performance Analysis Agent to iteratively refine functional correctness and optimize hardware utilization, leveraging single-shot initialization and profiling-driven feedback. Across CUDA and MPS backends, KForge demonstrates substantial gains in correctness and, in many cases, speedups relative to baselines, while revealing the nuanced role of profiling data and the benefits of cross-platform reference implementations. The work contributes a modular, reusable workflow for cross-accelerator kernel generation, highlights practical limitations such as noise in profiling signals and potential for local optima, and proposes concrete directions for richer feedback and formal verification. Overall, KForge advances automated, cross-platform kernel synthesis with demonstrated applicability to diverse parallel architectures and workload classes.
Abstract
GPU kernels are critical for ML performance but difficult to optimize across diverse accelerators. We present KForge, a platform-agnostic framework built on two collaborative LLM-based agents: a generation agent that produces and iteratively refines programs through compilation and correctness feedback, and a performance analysis agent that interprets profiling data to guide optimization. This agent-based architecture requires only a single-shot example to target new platforms. We make three key contributions: (1) introducing an iterative refinement system where the generation agent and performance analysis agent collaborate through functional and optimization passes, interpreting diverse profiling data (from programmatic APIs to GUI-based tools) to generate actionable recommendations that guide program synthesis for arbitrary accelerators; (2) demonstrating that the generation agent effectively leverages cross-platform knowledge transfer, where a reference implementation from one architecture substantially improves generation quality for different hardware targets; and (3) validating the platform-agnostic nature of our approach by demonstrating effective program synthesis across fundamentally different parallel computing platforms: NVIDIA CUDA and Apple Metal.
