Table of Contents
Fetching ...

PerfDojo: Automated ML Library Generation for Heterogeneous Architectures

Andrei Ivanov, Siyuan Shen, Gioele Gottardo, Marcin Chrapek, Afif Boudaoud, Timo Schneider, Luca Benini, Torsten Hoefler

TL;DR

This work tackles performance portability across heterogeneous architectures by introducing PerfDojo, a transformation-centric IR and RL-enabled optimization environment, and PerfLLM, which uses reinforcement learning guided by LLM-encoded representations to discover high-performance kernel transformations without hardware priors. The core idea is to guarantee semantic preservation of each transformation, enabling automated exploration of a large optimization space. Key contributions include a semantic, human-readable IR with atomic, non-destructive transformations; a formal MDP and Max Q-Learning formulation for transformation sequencing; and empirical results showing meaningful speedups on x86, Arm, RISC-V, and state-of-the-art GPUs, including MI300A and GH200. The approach demonstrates that hardware-aware performance gains can be achieved through transformation-centric search and learning, reducing reliance on vendor-specific heuristics while maintaining correctness and portability.

Abstract

The increasing complexity of machine learning models and the proliferation of diverse hardware architectures (CPUs, GPUs, accelerators) make achieving optimal performance a significant challenge. Heterogeneity in instruction sets, specialized kernel requirements for different data types and model features (e.g., sparsity, quantization), and architecture-specific optimizations complicate performance tuning. Manual optimization is resource-intensive, while existing automatic approaches often rely on complex hardware-specific heuristics and uninterpretable intermediate representations, hindering performance portability. We introduce PerfLLM, a novel automatic optimization methodology leveraging Large Language Models (LLMs) and Reinforcement Learning (RL). Central to this is PerfDojo, an environment framing optimization as an RL game using a human-readable, mathematically-inspired code representation that guarantees semantic validity through transformations. This allows effective optimization without prior hardware knowledge, facilitating both human analysis and RL agent training. We demonstrate PerfLLM's ability to achieve significant performance gains across diverse CPU (x86, Arm, RISC-V) and GPU architectures.

PerfDojo: Automated ML Library Generation for Heterogeneous Architectures

TL;DR

This work tackles performance portability across heterogeneous architectures by introducing PerfDojo, a transformation-centric IR and RL-enabled optimization environment, and PerfLLM, which uses reinforcement learning guided by LLM-encoded representations to discover high-performance kernel transformations without hardware priors. The core idea is to guarantee semantic preservation of each transformation, enabling automated exploration of a large optimization space. Key contributions include a semantic, human-readable IR with atomic, non-destructive transformations; a formal MDP and Max Q-Learning formulation for transformation sequencing; and empirical results showing meaningful speedups on x86, Arm, RISC-V, and state-of-the-art GPUs, including MI300A and GH200. The approach demonstrates that hardware-aware performance gains can be achieved through transformation-centric search and learning, reducing reliance on vendor-specific heuristics while maintaining correctness and portability.

Abstract

The increasing complexity of machine learning models and the proliferation of diverse hardware architectures (CPUs, GPUs, accelerators) make achieving optimal performance a significant challenge. Heterogeneity in instruction sets, specialized kernel requirements for different data types and model features (e.g., sparsity, quantization), and architecture-specific optimizations complicate performance tuning. Manual optimization is resource-intensive, while existing automatic approaches often rely on complex hardware-specific heuristics and uninterpretable intermediate representations, hindering performance portability. We introduce PerfLLM, a novel automatic optimization methodology leveraging Large Language Models (LLMs) and Reinforcement Learning (RL). Central to this is PerfDojo, an environment framing optimization as an RL game using a human-readable, mathematically-inspired code representation that guarantees semantic validity through transformations. This allows effective optimization without prior hardware knowledge, facilitating both human analysis and RL agent training. We demonstrate PerfLLM's ability to achieve significant performance gains across diverse CPU (x86, Arm, RISC-V) and GPU architectures.

Paper Structure

This paper contains 25 sections, 6 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Manual vs. PerfDojo's transformation-centric optimization workflows.
  • Figure 2: Softmax kernel representations.
  • Figure 3: Optimization of a softmax kernel through a sequence of transformations (moves) on a CPU with AVX-512 extensions. Each move in the PerfDojo game maintains the initial program semantics.
  • Figure 4: Program transformation example: Buffer dimension reuse (reuse_dims) is correctly applied with prior loop fusion (join_scopes), as shown in the top, but yields incorrect computation without it, as shown in the bottom.
  • Figure 5: An example comparing the Q-value updates in original Q-Learning and Max Q-Learning. The best achievable state $S_3$, highlighted in green, demonstrates how Max Q-Learning explicitly prioritizes trajectories leading to higher peak rewards, thereby selecting action $a_1$, whereas Original Q-Learning selects the immediate "stop" action $a_0$.
  • ...and 8 more figures