Opal: A Modular Framework for Optimizing Performance using Analytics and LLMs
Mohammad Zaeed, Tanzima Z. Islam, Vladimir Inđić
TL;DR
This work tackles the challenge of turning runtime performance diagnostics into actionable GPU code optimizations. It introduces Opal, a modular framework that fuses Roofline, PC sampling, and hardware-counter analysis into token-efficient prompts for an LLM, augmented by belief tracing to reveal the model's reasoning. Across 1640 experiments on NVIDIA and AMD GPUs, Opal delivers substantial speedups (up to 87.6%) with high transformation correctness and provides explainable, auditable edits linked to concrete diagnostics. The approach democratizes expert-level performance engineering by automating the diagnosis-to-optimization loop and is readily extensible to future architectures and accelerator platforms.
Abstract
Large Language Models (LLMs) show promise for automated code optimization but struggle without performance context. This work introduces Opal, a modular framework that connects performance analytics insights with the vast body of published by guiding LLMs to generate informed, trustworthy optimizations. Unlike traditional performance tools that identify bottlenecks but stop short of actionable suggestions, Opal bridges this long-standing gap by linking dynamic insights from hardware counters and Roofline analysis to stall events to optimization decisions. We evaluate Opal across 1640 experiments on real-world GPU kernels and find that in over 98.5% of cases, even a single insight source yields speedups, ranging on average from 19.34% to 52.3%. Our prompt template produced correct code in all but one case, where a vague diagnostic caused an unsafe suggestion. By automatically optimizing GPU kernels using performance analytics and LLMs, Opal marks a leap toward democratizing expert-level performance engineering for all.
