Linear Layouts: Robust Code Generation of Efficient Tensor Computation Using $\mathbb{F}_2$
Keren Zhou, Mario Lezcano, Adam Goucher, Akhmed Rakhmati, Jeff Niu, Justin Lebar, Pawel Szczerbuk, Peter Bell, Phil Tillet, Thomas Raoux, Zahi Moudallal
TL;DR
This work tackles the fragmentation and brittleness of tensor layouts in deep learning accelerators by introducing Linear Layouts, a $GF_2$-based framework that expresses tensor layouts as linear maps between labeled binary vector spaces. By enabling composition, left-division, and inversion, the approach provides a generic, robust mechanism for layout definitions and inter-layout conversions, seamlessly integrating with Triton to propagate layouts through shapes and memory operations. The authors demonstrate automatic swizzling, warp-shuffle optimization, and generic lowering of hardware intrinsics within Triton, yielding up to $1.40\times$ speedups (average $1.07\times$) across 265 benchmarks and fixing several Triton layout-related bugs. The work offers a principled foundation for layout optimization that reduces engineering effort, improves correctness, and enables hardware-aware code generation across platforms.
Abstract
Efficient tensor computation is a cornerstone of modern deep learning (DL) workloads, yet existing approaches struggle to achieve flexible and performant design and implementation of tensor layouts -- mappings between logical tensors and hardware resources. The increasing complexity of DL algorithms and hardware demands a generic and systematic approach to handling tensor layouts. In this work, we introduce Linear Layouts, a novel approach that models tensor layouts using linear algebra over $\mathbb{F}_2$. By representing tensor layouts as binary matrices acting on the bits of the hardware representation, our approach enables a generic layout definition -- as opposed to the classical case-by-case approach -- and allows for generic layout-to-layout conversions, eliminating the quadratic explosion that plagues existing solutions. We integrate linear layouts with Triton and demonstrate their effectiveness in optimizing individual Triton operators as well as kernels written in Triton. We also show that linear layouts reduce engineering effort in the compiler backend while fixing several bugs in Triton's legacy layout system.
