Table of Contents
Fetching ...

LLMPerf: GPU Performance Modeling meets Large Language Models

Khoi N. M. Nguyen, Hoang Duy Nguyen Do, Huyen Thao Le, Thanh Tuan Dao

TL;DR

This work introduces LLMPerf, a framework that uses large language models to predict OpenCL kernel execution time from static kernel code and launch configurations, addressing the lack of scalable, architecture-agnostic performance models. It builds a large-scale OpenCL kernel performance dataset by coupling a 1D-dominated kernel corpus with memory-analysis and empirical strategies to generate diverse launch configurations, while mitigating data imbalance. LLMPerf fine-tunes CodeGen with a regression head to predict the logarithm of execution time, achieving a MAPE of $24.25\%$ on a 400K-scale validation set and showing competitive generalization on real benchmarks (average MAPE $46.11\%$). The results demonstrate the potential for NLP models to contribute to performance modeling, while also revealing limitations linked to kernel patterns and input-size versus global-size correlations, motivating further dataset and representation improvements.

Abstract

Performance modeling, a pivotal domain in program cost analysis, currently relies on manually crafted models constrained by various program and hardware limitations, especially in the intricate landscape of GPGPU. Meanwhile, Large Language Models (LLMs) have demonstrated their effectiveness in addressing diverse programming challenges. Our work establishes a connection between LLMs and performance modeling, employing the LLM as a performance estimator. Through experimental exploration with carefully designed large-scale OpenCL datasets, we highlight the potential capability as well as the main difficulties of using LLMs in handling performance modeling tasks for OpenCL device source programs. As the first study for this line of work, our LLM-based performance model achieves a mean absolute percentage error of $24.25\%$ for a large-scale generated validation set. On a set of publicly available OpenCL programs, our model achieves a mean absolute percentage error of $46.1\%$.

LLMPerf: GPU Performance Modeling meets Large Language Models

TL;DR

This work introduces LLMPerf, a framework that uses large language models to predict OpenCL kernel execution time from static kernel code and launch configurations, addressing the lack of scalable, architecture-agnostic performance models. It builds a large-scale OpenCL kernel performance dataset by coupling a 1D-dominated kernel corpus with memory-analysis and empirical strategies to generate diverse launch configurations, while mitigating data imbalance. LLMPerf fine-tunes CodeGen with a regression head to predict the logarithm of execution time, achieving a MAPE of on a 400K-scale validation set and showing competitive generalization on real benchmarks (average MAPE ). The results demonstrate the potential for NLP models to contribute to performance modeling, while also revealing limitations linked to kernel patterns and input-size versus global-size correlations, motivating further dataset and representation improvements.

Abstract

Performance modeling, a pivotal domain in program cost analysis, currently relies on manually crafted models constrained by various program and hardware limitations, especially in the intricate landscape of GPGPU. Meanwhile, Large Language Models (LLMs) have demonstrated their effectiveness in addressing diverse programming challenges. Our work establishes a connection between LLMs and performance modeling, employing the LLM as a performance estimator. Through experimental exploration with carefully designed large-scale OpenCL datasets, we highlight the potential capability as well as the main difficulties of using LLMs in handling performance modeling tasks for OpenCL device source programs. As the first study for this line of work, our LLM-based performance model achieves a mean absolute percentage error of for a large-scale generated validation set. On a set of publicly available OpenCL programs, our model achieves a mean absolute percentage error of .

Paper Structure

This paper contains 17 sections, 3 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: Example of execution time data before and after using IQR
  • Figure 2: Result of LLMPerf on real benchmark kernels by data (input) size. Red points and blue points are prediction time and its corresponding target time.