Table of Contents
Fetching ...

KForge: Program Synthesis for Diverse AI Hardware Accelerators

Taras Sereda, Tom St. John, Burak Bartan, Natalie Serrino, Sachin Katti, Zain Asgar

TL;DR

KForge introduces a dual-agent, platform-agnostic framework for autonomous program synthesis of high-performance kernels across NVIDIA CUDA and Apple Metal. It couples a Program Synthesis Agent with a Performance Analysis Agent to iteratively refine functional correctness and optimize hardware utilization, leveraging single-shot initialization and profiling-driven feedback. Across CUDA and MPS backends, KForge demonstrates substantial gains in correctness and, in many cases, speedups relative to baselines, while revealing the nuanced role of profiling data and the benefits of cross-platform reference implementations. The work contributes a modular, reusable workflow for cross-accelerator kernel generation, highlights practical limitations such as noise in profiling signals and potential for local optima, and proposes concrete directions for richer feedback and formal verification. Overall, KForge advances automated, cross-platform kernel synthesis with demonstrated applicability to diverse parallel architectures and workload classes.

Abstract

GPU kernels are critical for ML performance but difficult to optimize across diverse accelerators. We present KForge, a platform-agnostic framework built on two collaborative LLM-based agents: a generation agent that produces and iteratively refines programs through compilation and correctness feedback, and a performance analysis agent that interprets profiling data to guide optimization. This agent-based architecture requires only a single-shot example to target new platforms. We make three key contributions: (1) introducing an iterative refinement system where the generation agent and performance analysis agent collaborate through functional and optimization passes, interpreting diverse profiling data (from programmatic APIs to GUI-based tools) to generate actionable recommendations that guide program synthesis for arbitrary accelerators; (2) demonstrating that the generation agent effectively leverages cross-platform knowledge transfer, where a reference implementation from one architecture substantially improves generation quality for different hardware targets; and (3) validating the platform-agnostic nature of our approach by demonstrating effective program synthesis across fundamentally different parallel computing platforms: NVIDIA CUDA and Apple Metal.

KForge: Program Synthesis for Diverse AI Hardware Accelerators

TL;DR

KForge introduces a dual-agent, platform-agnostic framework for autonomous program synthesis of high-performance kernels across NVIDIA CUDA and Apple Metal. It couples a Program Synthesis Agent with a Performance Analysis Agent to iteratively refine functional correctness and optimize hardware utilization, leveraging single-shot initialization and profiling-driven feedback. Across CUDA and MPS backends, KForge demonstrates substantial gains in correctness and, in many cases, speedups relative to baselines, while revealing the nuanced role of profiling data and the benefits of cross-platform reference implementations. The work contributes a modular, reusable workflow for cross-accelerator kernel generation, highlights practical limitations such as noise in profiling signals and potential for local optima, and proposes concrete directions for richer feedback and formal verification. Overall, KForge advances automated, cross-platform kernel synthesis with demonstrated applicability to diverse parallel architectures and workload classes.

Abstract

GPU kernels are critical for ML performance but difficult to optimize across diverse accelerators. We present KForge, a platform-agnostic framework built on two collaborative LLM-based agents: a generation agent that produces and iteratively refines programs through compilation and correctness feedback, and a performance analysis agent that interprets profiling data to guide optimization. This agent-based architecture requires only a single-shot example to target new platforms. We make three key contributions: (1) introducing an iterative refinement system where the generation agent and performance analysis agent collaborate through functional and optimization passes, interpreting diverse profiling data (from programmatic APIs to GUI-based tools) to generate actionable recommendations that guide program synthesis for arbitrary accelerators; (2) demonstrating that the generation agent effectively leverages cross-platform knowledge transfer, where a reference implementation from one architecture substantially improves generation quality for different hardware targets; and (3) validating the platform-agnostic nature of our approach by demonstrating effective program synthesis across fundamentally different parallel computing platforms: NVIDIA CUDA and Apple Metal.

Paper Structure

This paper contains 33 sections, 2 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Iterative program synthesis and optimization loop using LLMs. The workflow consists of two main phases: (1) a functional pass that iteratively refines synthesized programs until the code compiles, executes without errors, and produces correct output, and (2) an optimization pass that provides performance feedback to the LLM for iterative performance improvement.
  • Figure 2: Program synthesis prompt template
  • Figure 3: CUDA Program Synthesis. Iterative refinement against PyTorch Eager Mode
  • Figure 4: CUDA Program synthesis. Iterative refinement vs. Iterative refinement + Profiling Information against torch.compile
  • Figure 5: MPS program synthesis. Iterative refinement vs. Iterative refinement + CUDA reference implementation