Table of Contents
Fetching ...

TileLang: A Composable Tiled Programming Model for AI Systems

Lei Wang, Yu Cheng, Yining Shi, Zhengju Tang, Zhiwen Mo, Wenhao Xie, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, Zhi Yang

TL;DR

TileLang introduces a Python-embedded, tile-based programming model that decouples dataflow from scheduling to simplify writing high-performance AI kernels. The system provides tile declarations, explicit memory placement, dataflow operators with layout-inference capabilities, and scheduling primitives, all supported by a compiler pipeline that lowers to hardware-optimized code. Empirical results across NVIDIA and AMD GPUs show TileLang achieving state-of-the-art or competitive performance on kernels like GEMM, MHA, and linear attention, while maintaining significantly shorter, more maintainable code compared to baselines. The work suggests strong potential for broader hardware support and future enhancements, including self-hosted tile libraries, distributed scenarios, cost modeling, and dynamic-shape optimization, with open-source availability.

Abstract

Modern AI workloads rely heavily on optimized computing kernels for both training and inference. These AI kernels follow well-defined data-flow patterns, such as moving tiles between DRAM and SRAM and performing a sequence of computations on those tiles. However, writing high-performance kernels remains complex despite the clarity of these patterns. Achieving peak performance requires careful, hardware-centric optimizations to fully leverage modern accelerators. While domain-specific compilers attempt to reduce the burden of writing high-performance kernels, they often struggle with usability and expressiveness gaps. In this paper, we present TileLang, a generalized tiled programming model for more efficient AI Kernel programming. TileLang decouples scheduling space (thread binding, layout, tensorize and pipeline) from dataflow, and encapsulated them as a set of customization annotations and primitives. This approach allows users to focus on the kernel's data-flow itself, while leaving most other optimizations to compilers. We conduct comprehensive experiments on commonly-used devices, across numerous experiments, our evaluation shows that TileLang can achieve state-of-the-art performance in key kernels, demonstrating that its unified block-and-thread paradigm and transparent scheduling capabilities deliver both the power and flexibility demanded by modern AI system development.

TileLang: A Composable Tiled Programming Model for AI Systems

TL;DR

TileLang introduces a Python-embedded, tile-based programming model that decouples dataflow from scheduling to simplify writing high-performance AI kernels. The system provides tile declarations, explicit memory placement, dataflow operators with layout-inference capabilities, and scheduling primitives, all supported by a compiler pipeline that lowers to hardware-optimized code. Empirical results across NVIDIA and AMD GPUs show TileLang achieving state-of-the-art or competitive performance on kernels like GEMM, MHA, and linear attention, while maintaining significantly shorter, more maintainable code compared to baselines. The work suggests strong potential for broader hardware support and future enhancements, including self-hosted tile libraries, distributed scenarios, cost modeling, and dynamic-shape optimization, with open-source availability.

Abstract

Modern AI workloads rely heavily on optimized computing kernels for both training and inference. These AI kernels follow well-defined data-flow patterns, such as moving tiles between DRAM and SRAM and performing a sequence of computations on those tiles. However, writing high-performance kernels remains complex despite the clarity of these patterns. Achieving peak performance requires careful, hardware-centric optimizations to fully leverage modern accelerators. While domain-specific compilers attempt to reduce the burden of writing high-performance kernels, they often struggle with usability and expressiveness gaps. In this paper, we present TileLang, a generalized tiled programming model for more efficient AI Kernel programming. TileLang decouples scheduling space (thread binding, layout, tensorize and pipeline) from dataflow, and encapsulated them as a set of customization annotations and primitives. This approach allows users to focus on the kernel's data-flow itself, while leaving most other optimizations to compilers. We conduct comprehensive experiments on commonly-used devices, across numerous experiments, our evaluation shows that TileLang can achieve state-of-the-art performance in key kernels, demonstrating that its unified block-and-thread paradigm and transparent scheduling capabilities deliver both the power and flexibility demanded by modern AI system development.

Paper Structure

This paper contains 30 sections, 18 figures, 4 tables.

Figures (18)

  • Figure 1: An example TileLang program and the corresponding lowered ir and generated cuda c code. The code snippets are simplified for demonstration purposes.
  • Figure 2: Stages of TileLang Compile Pipeline.
  • Figure 3: Optimizing GEMM with Multi-Level Tiling on GPUs via TileLang.
  • Figure 4: Interface of a Tile-Operator, and example instances of TileOP.
  • Figure 5: Interface and example instances of Layout Function.
  • ...and 13 more figures