Table of Contents
Fetching ...

Opal: A Modular Framework for Optimizing Performance using Analytics and LLMs

Mohammad Zaeed, Tanzima Z. Islam, Vladimir Inđić

TL;DR

This work tackles the challenge of turning runtime performance diagnostics into actionable GPU code optimizations. It introduces Opal, a modular framework that fuses Roofline, PC sampling, and hardware-counter analysis into token-efficient prompts for an LLM, augmented by belief tracing to reveal the model's reasoning. Across 1640 experiments on NVIDIA and AMD GPUs, Opal delivers substantial speedups (up to 87.6%) with high transformation correctness and provides explainable, auditable edits linked to concrete diagnostics. The approach democratizes expert-level performance engineering by automating the diagnosis-to-optimization loop and is readily extensible to future architectures and accelerator platforms.

Abstract

Large Language Models (LLMs) show promise for automated code optimization but struggle without performance context. This work introduces Opal, a modular framework that connects performance analytics insights with the vast body of published by guiding LLMs to generate informed, trustworthy optimizations. Unlike traditional performance tools that identify bottlenecks but stop short of actionable suggestions, Opal bridges this long-standing gap by linking dynamic insights from hardware counters and Roofline analysis to stall events to optimization decisions. We evaluate Opal across 1640 experiments on real-world GPU kernels and find that in over 98.5% of cases, even a single insight source yields speedups, ranging on average from 19.34% to 52.3%. Our prompt template produced correct code in all but one case, where a vague diagnostic caused an unsafe suggestion. By automatically optimizing GPU kernels using performance analytics and LLMs, Opal marks a leap toward democratizing expert-level performance engineering for all.

Opal: A Modular Framework for Optimizing Performance using Analytics and LLMs

TL;DR

This work tackles the challenge of turning runtime performance diagnostics into actionable GPU code optimizations. It introduces Opal, a modular framework that fuses Roofline, PC sampling, and hardware-counter analysis into token-efficient prompts for an LLM, augmented by belief tracing to reveal the model's reasoning. Across 1640 experiments on NVIDIA and AMD GPUs, Opal delivers substantial speedups (up to 87.6%) with high transformation correctness and provides explainable, auditable edits linked to concrete diagnostics. The approach democratizes expert-level performance engineering by automating the diagnosis-to-optimization loop and is readily extensible to future architectures and accelerator platforms.

Abstract

Large Language Models (LLMs) show promise for automated code optimization but struggle without performance context. This work introduces Opal, a modular framework that connects performance analytics insights with the vast body of published by guiding LLMs to generate informed, trustworthy optimizations. Unlike traditional performance tools that identify bottlenecks but stop short of actionable suggestions, Opal bridges this long-standing gap by linking dynamic insights from hardware counters and Roofline analysis to stall events to optimization decisions. We evaluate Opal across 1640 experiments on real-world GPU kernels and find that in over 98.5% of cases, even a single insight source yields speedups, ranging on average from 19.34% to 52.3%. Our prompt template produced correct code in all but one case, where a vague diagnostic caused an unsafe suggestion. By automatically optimizing GPU kernels using performance analytics and LLMs, Opal marks a leap toward democratizing expert-level performance engineering for all.

Paper Structure

This paper contains 33 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the Opal framework. Users select performance profiles, and Opal constructs structured prompts compatible with any , including custom models.
  • Figure 2: Ablation study of the impact of diagnostic sources on optimization performance. Error bars represent performance variability across repeated runs. IA alone achieves significant single-source gains, notably 64.93% on Shmembench. PC+IA consistently delivers stable optimizations across kernels. Sobol shows substantial variability, indicating that the type of information can impact the effectiveness of code transformation.
  • Figure 3: Performance improvements across input configurations. Blue rectangles represent improvements for default configuration (Table \ref{['tab:app-config-summary']}), green circles indicate maximum improvement, and red diamonds indicate minimum improvement. This figure shows that optimizations generated for one configuration also yield performance improvements across other configurations, however, the margin varies across applications. Accuracy achieves the most stable gains (43.1%–45.5%), while that for Sobol varies widely (4.6%–87.6%).
  • Figure 4: Comparison of stall occurrences between unoptimized (left) and optimized (right) Accuracy kernels. The Y-axis lists line numbers and shortened code snippets. Each bar is labeled by the dominant stall type. At line 6, stall_wait occurrences decrease from $45,387 \to 22,642$; similar significant reductions occur at lines $29 \to 31$, validating the effectiveness of targeted optimizations.
  • Figure 5: Ablation study on source contributions to optimization performance for HIP applications. We observe massive improvements for Babelstream kernels (41.5 to 99.27%) using IA and Roofline analysis together as performance insights. For all other kernels, there is no improvement (some improvements are shown, but that is purely because of GPU execution uncertainty).
  • ...and 1 more figures