Hexcute: A Tile-based Programming Language with Automatic Layout and Task-Mapping Synthesis
Xiao Zhang, Yaoyao Ding, Yang Hu, Gennady Pekhimenko
TL;DR
Hexcute presents a tile-based GPU programming language that exposes shared memory and register abstractions to enable fine-grained optimizations for mixed-type DL operators. A novel type-inference-based algorithm synthesizes task mappings and layouts by treating thread-value layouts as part of tensor types, formulating layout constraints, and propagating them to generate efficient low-level code. The system unifies high expressiveness with automation, delivering up to 1.7–11.28$ imes$ speedups over state-of-the-art compilers for mixed-type operators and up to 2.91$ imes$ end-to-end improvements in vLLM deployments. Across GEMM, MHA, MoE, and end-to-end DL workloads, Hexcute demonstrates broad generality, scalable automation, and practical impact for accelerating DL inference and training on modern GPUs.
Abstract
Deep learning (DL) workloads mainly run on accelerators like GPUs. Recent DL quantization techniques demand a new matrix multiplication operator with mixed input data types, further complicating GPU optimization. Prior high-level compilers like Triton lack the expressiveness to implement key optimizations like fine-grained data pipelines and hardware-friendly memory layouts for these operators, while low-level programming models, such as Hidet, Graphene, and CUTLASS, require significant programming efforts. To balance expressiveness with engineering effort, we propose Hexcute, a tile-based programming language that exposes shared memory and register abstractions to enable fine-grained optimization for these operators. Additionally, Hexcute leverages task mapping to schedule the GPU program, and to reduce programming efforts, it automates layout and task mapping synthesis with a novel type-inference-based algorithm. Our evaluation shows that Hexcute generalizes to a wide range of DL operators, achieves 1.7-11.28$\times$ speedup over existing DL compilers for mixed-type operators, and brings up to 2.91$\times$ speedup in the end-to-end evaluation.
