Table of Contents
Fetching ...

TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators

Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, Haojie Wang, Jianrong Wang, Xu Han, Zhiyuan Liu, Maosong Sun

TL;DR

TritonBench introduces a dual-channel benchmark for evaluating LLMs tasked with generating Triton operators, combining real-world GitHub operators (TritonBench-G) with PyTorch-aligned tasks (TritonBench-T). It pairs functional testing with performance profiling on NVIDIA GPUs and uses multiple metrics, including a CodeBLEU-based similarity score and GPU efficiency, to assess both correctness and efficiency. Across a suite of baselines, results show current LLMs struggle to produce high-quality, performant Triton code, though performance improves with one-shot prompts and domain-specific fine-tuning. The framework aims to guide future research in DSL-aware, performance-aware automatic operator generation for Triton and similar GPU-DSL ecosystems.

Abstract

Triton, a high-level Python-like language designed for building efficient GPU kernels, is widely adopted in deep learning frameworks due to its portability, flexibility, and accessibility. However, programming and parallel optimization still require considerable trial and error from Triton developers. Despite advances in large language models (LLMs) for conventional code generation, these models struggle to generate accurate, performance-optimized Triton code, as they lack awareness of its specifications and the complexities of GPU programming. More critically, there is an urgent need for systematic evaluations tailored to Triton. In this work, we introduce TritonBench, the first comprehensive benchmark for Triton operator generation. TritonBench features two evaluation channels: a curated set of 184 real-world operators from GitHub and a collection of operators aligned with PyTorch interfaces. Unlike conventional code benchmarks prioritizing functional correctness, TritonBench also profiles efficiency performance on widely deployed GPUs aligned with industry applications. Our study reveals that current state-of-the-art code LLMs struggle to generate efficient Triton operators, highlighting a significant gap in high-performance code generation. TritonBench will be available at https://github.com/thunlp/TritonBench.

TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators

TL;DR

TritonBench introduces a dual-channel benchmark for evaluating LLMs tasked with generating Triton operators, combining real-world GitHub operators (TritonBench-G) with PyTorch-aligned tasks (TritonBench-T). It pairs functional testing with performance profiling on NVIDIA GPUs and uses multiple metrics, including a CodeBLEU-based similarity score and GPU efficiency, to assess both correctness and efficiency. Across a suite of baselines, results show current LLMs struggle to produce high-quality, performant Triton code, though performance improves with one-shot prompts and domain-specific fine-tuning. The framework aims to guide future research in DSL-aware, performance-aware automatic operator generation for Triton and similar GPU-DSL ecosystems.

Abstract

Triton, a high-level Python-like language designed for building efficient GPU kernels, is widely adopted in deep learning frameworks due to its portability, flexibility, and accessibility. However, programming and parallel optimization still require considerable trial and error from Triton developers. Despite advances in large language models (LLMs) for conventional code generation, these models struggle to generate accurate, performance-optimized Triton code, as they lack awareness of its specifications and the complexities of GPU programming. More critically, there is an urgent need for systematic evaluations tailored to Triton. In this work, we introduce TritonBench, the first comprehensive benchmark for Triton operator generation. TritonBench features two evaluation channels: a curated set of 184 real-world operators from GitHub and a collection of operators aligned with PyTorch interfaces. Unlike conventional code benchmarks prioritizing functional correctness, TritonBench also profiles efficiency performance on widely deployed GPUs aligned with industry applications. Our study reveals that current state-of-the-art code LLMs struggle to generate efficient Triton operators, highlighting a significant gap in high-performance code generation. TritonBench will be available at https://github.com/thunlp/TritonBench.

Paper Structure

This paper contains 35 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Illustration of the construction and evaluation of TritonBench.
  • Figure 2: Implementation of the Triton "add" operator. Lines $3$-$6$ perform for tensor element addressing, followed by the calculation and storage in lines $7$-$10$. The kernel is called in wrapper line $15$.
  • Figure 3: Distribution of GPU efficiency of the Triton operators in TritonBench-G.
  • Figure 4: Execution results distribution across difficulty levels in TritonBench-G.
  • Figure 5: Execution results distribution across difficulty levels in TritonBench-T.
  • ...and 2 more figures