Table of Contents
Fetching ...

Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity

Gregory Bolet, Giorgis Georgakoudis, Konstantinos Parasyris, Harshitha Menon, Niranjan Hasabnis, Kirk W. Cameron, Gal Oren

TL;DR

This work confronts static reasoning about GPU kernel performance by introducing gpuFLOPBench, a dataset of 577 real CUDA kernels with ground-truth SP/DP FLOP counts and eight execution-attribute annotations. It evaluates contemporary closed-source LLMs on two tasks: classifying FLOP workloads and predicting exact FLOP counts without execution, revealing that models excel on straightforward kernels but fail to account for implicit FLOPs arising from division, intrinsics, and common subexpressions due to hardware microcode effects. The results demonstrate gradual improvements in reasoning models over time yet identify a core limitation: current LLMs lack an internal hardware-aware cost model, especially for Mixed workloads, motivating hybrid approaches that combine LLM reasoning with profiling or ISA-level templates. gpuFLOPBench thus provides a focused benchmark to drive development of performance-aware coding assistants that reason about cost with the same rigor as expert GPU developers.

Abstract

Modern GPU software stacks demand developers who can anticipate performance bottlenecks before ever launching a kernel; misjudging floating-point workloads upstream can derail tuning, scheduling, and even hardware procurement. Yet despite rapid progress in code generation, today's Large Language Models (LLMs) are rarely tested on this kind of forward-looking reasoning. We close that gap with gpuFLOPBench, a benchmark that asks models to "count without running" by predicting single and double-precision FLOP counts for 577 CUDA kernels drawn from HeCBench, annotated with ground-truth profiles and eight execution attributes that distinguish trivially analyzable code from kernels whose FLOPs depend on hidden compiler or runtime behavior. Evaluating current closed-source reasoning models shows clear but uneven progress: the newest LLMs achieve perfect classification on straightforward kernels but still incur multiple order-of-magnitude errors whenever implicit FLOPs arise from division, intrinsic math functions, or common subexpressions. These results surface a core limitation of existing code assistants -- the inability to internalize hardware-specific microcode effects -- and position gpuFLOPBench as a focused testbed for developing LLM tooling that can reason about performance with the same rigor as experienced GPU developers. Sources are available at our repository: https://github.com/Scientific-Computing-Lab/gpuFLOPBench

Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity

TL;DR

This work confronts static reasoning about GPU kernel performance by introducing gpuFLOPBench, a dataset of 577 real CUDA kernels with ground-truth SP/DP FLOP counts and eight execution-attribute annotations. It evaluates contemporary closed-source LLMs on two tasks: classifying FLOP workloads and predicting exact FLOP counts without execution, revealing that models excel on straightforward kernels but fail to account for implicit FLOPs arising from division, intrinsics, and common subexpressions due to hardware microcode effects. The results demonstrate gradual improvements in reasoning models over time yet identify a core limitation: current LLMs lack an internal hardware-aware cost model, especially for Mixed workloads, motivating hybrid approaches that combine LLM reasoning with profiling or ISA-level templates. gpuFLOPBench thus provides a focused benchmark to drive development of performance-aware coding assistants that reason about cost with the same rigor as expert GPU developers.

Abstract

Modern GPU software stacks demand developers who can anticipate performance bottlenecks before ever launching a kernel; misjudging floating-point workloads upstream can derail tuning, scheduling, and even hardware procurement. Yet despite rapid progress in code generation, today's Large Language Models (LLMs) are rarely tested on this kind of forward-looking reasoning. We close that gap with gpuFLOPBench, a benchmark that asks models to "count without running" by predicting single and double-precision FLOP counts for 577 CUDA kernels drawn from HeCBench, annotated with ground-truth profiles and eight execution attributes that distinguish trivially analyzable code from kernels whose FLOPs depend on hidden compiler or runtime behavior. Evaluating current closed-source reasoning models shows clear but uneven progress: the newest LLMs achieve perfect classification on straightforward kernels but still incur multiple order-of-magnitude errors whenever implicit FLOPs arise from division, intrinsic math functions, or common subexpressions. These results surface a core limitation of existing code assistants -- the inability to internalize hardware-specific microcode effects -- and position gpuFLOPBench as a focused testbed for developing LLM tooling that can reason about performance with the same rigor as experienced GPU developers. Sources are available at our repository: https://github.com/Scientific-Computing-Lab/gpuFLOPBench

Paper Structure

This paper contains 16 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of dataset creation steps gpuFLOPBench. We scrape CUDA kernels from the HeCBench suite and profile the first invocations of the kernels, gathering: kernel names, source codes, grid and block size, kernel launch params, command-line execution arguments, and profiled Single/Double-precision FLOP counts. Each kernel's source code is manually classified as having execution attributes that could introduce implicit/indirect FLOPs, leading to its categorization as a hard code that cannot easily be directly statically analyzed. Example values from the lulesh-cuda program are shown for clarification.
  • Figure 2: Distribution of FLOP workload type in easy (inner ring) and hard (outer ring) subsets of gpuFLOPBench. The easy subset lacks any mixed kernels that do both SP and DP FLOPs. The hard subset is larger and has class imbalance due to the difficulties of balancing.
  • Figure 3: Counts of kernels with given combinations of execution attributes for 717 manually classified CUDA kernels from the HeCBench suite. Attribute combination indicators are represented as binary strings in the same order as the legend. Green bars indicate attribute combinations that can be directly statically analyzed, while red bars are combinations that cannot be directly statically analyzed and can thus engender indirect FLOPs.
  • Figure 4: Mean Absolute Log Error (MALE) results by subset on (a) FLOP workload categorization and (b) kernel code attributes. For MALE error, lower is better.
  • Figure 5: LangGraph tool call structure passed to an LLM upon invocation. The field descriptions and datatypes help the LLM to accurately complete the tool call.
  • ...and 1 more figures