Table of Contents
Fetching ...

Hexcute: A Tile-based Programming Language with Automatic Layout and Task-Mapping Synthesis

Xiao Zhang, Yaoyao Ding, Yang Hu, Gennady Pekhimenko

TL;DR

Hexcute presents a tile-based GPU programming language that exposes shared memory and register abstractions to enable fine-grained optimizations for mixed-type DL operators. A novel type-inference-based algorithm synthesizes task mappings and layouts by treating thread-value layouts as part of tensor types, formulating layout constraints, and propagating them to generate efficient low-level code. The system unifies high expressiveness with automation, delivering up to 1.7–11.28$ imes$ speedups over state-of-the-art compilers for mixed-type operators and up to 2.91$ imes$ end-to-end improvements in vLLM deployments. Across GEMM, MHA, MoE, and end-to-end DL workloads, Hexcute demonstrates broad generality, scalable automation, and practical impact for accelerating DL inference and training on modern GPUs.

Abstract

Deep learning (DL) workloads mainly run on accelerators like GPUs. Recent DL quantization techniques demand a new matrix multiplication operator with mixed input data types, further complicating GPU optimization. Prior high-level compilers like Triton lack the expressiveness to implement key optimizations like fine-grained data pipelines and hardware-friendly memory layouts for these operators, while low-level programming models, such as Hidet, Graphene, and CUTLASS, require significant programming efforts. To balance expressiveness with engineering effort, we propose Hexcute, a tile-based programming language that exposes shared memory and register abstractions to enable fine-grained optimization for these operators. Additionally, Hexcute leverages task mapping to schedule the GPU program, and to reduce programming efforts, it automates layout and task mapping synthesis with a novel type-inference-based algorithm. Our evaluation shows that Hexcute generalizes to a wide range of DL operators, achieves 1.7-11.28$\times$ speedup over existing DL compilers for mixed-type operators, and brings up to 2.91$\times$ speedup in the end-to-end evaluation.

Hexcute: A Tile-based Programming Language with Automatic Layout and Task-Mapping Synthesis

TL;DR

Hexcute presents a tile-based GPU programming language that exposes shared memory and register abstractions to enable fine-grained optimizations for mixed-type DL operators. A novel type-inference-based algorithm synthesizes task mappings and layouts by treating thread-value layouts as part of tensor types, formulating layout constraints, and propagating them to generate efficient low-level code. The system unifies high expressiveness with automation, delivering up to 1.7–11.28 speedups over state-of-the-art compilers for mixed-type operators and up to 2.91 end-to-end improvements in vLLM deployments. Across GEMM, MHA, MoE, and end-to-end DL workloads, Hexcute demonstrates broad generality, scalable automation, and practical impact for accelerating DL inference and training on modern GPUs.

Abstract

Deep learning (DL) workloads mainly run on accelerators like GPUs. Recent DL quantization techniques demand a new matrix multiplication operator with mixed input data types, further complicating GPU optimization. Prior high-level compilers like Triton lack the expressiveness to implement key optimizations like fine-grained data pipelines and hardware-friendly memory layouts for these operators, while low-level programming models, such as Hidet, Graphene, and CUTLASS, require significant programming efforts. To balance expressiveness with engineering effort, we propose Hexcute, a tile-based programming language that exposes shared memory and register abstractions to enable fine-grained optimization for these operators. Additionally, Hexcute leverages task mapping to schedule the GPU program, and to reduce programming efforts, it automates layout and task mapping synthesis with a novel type-inference-based algorithm. Our evaluation shows that Hexcute generalizes to a wide range of DL operators, achieves 1.7-11.28 speedup over existing DL compilers for mixed-type operators, and brings up to 2.91 speedup in the end-to-end evaluation.

Paper Structure

This paper contains 27 sections, 1 theorem, 11 equations, 24 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Consider a operation where the tensors , , and have the thread-value layouts $f_A$, $f_B$, and $f_C$, respectively. Suppose we use a Tensor Core instruction $I$, represented by the thread-value layouts $p_A$, $p_B$, and $p_C$ for the $A$, $B$, and $C$ operands, to execute this operation. Let the projection be defined as follows: , and the natural embedding functions are defined: where $m_T$,

Figures (24)

  • Figure 1: A FP16$\times$INT4 matmul written with Triton, where activation matrix A is of the float16 data type and weight matrix B consists of 4-bit integers.
  • Figure 2: The dataflow of mixed-type matmul kernels
  • Figure 3: The INT4 packed layout designed in TensorRT-LLM ensures layout conformance when converting INT4 weights to F16 weights.
  • Figure 4: A thread block of 128 threads copies a 64$\times$8 tile from shared memory to register files. Organized as a 16$\times$8 grid, the threads copy the data in four iterations. Figure (a) shows the cooperative_load function implemented with task mapping constructs, while Figure (b) illustrates how task mapping defines the collective data movement.
  • Figure 5: Example: row-major interleaved layout A. Figure (a) visualizes the function defined by A, with the integers in the box indicating the function's outputs.
  • ...and 19 more figures

Theorems & Definitions (1)

  • Theorem 1