Pushing Tensor Accelerators Beyond MatMul in a User-Schedulable Language
Yihong Zhang, Derek Gerstmann, Andrew Adams, Maaz Bin Safeer Ahmad
TL;DR
The paper tackles the underutilization of tensor accelerators due to programming difficulty by introducing HardBoiled, a tensor instruction selector for Halide based on equality saturation. HardBoiled enables users to express algorithms in Halide and rely on sophisticated rewrite-driven mapping to various accelerators (AMX and Nvidia Tensor Cores), preserving the separation of algorithm and schedule. The work demonstrates significant performance gains across image processing and related linear-transform workloads, including a 6.1x speedup on convolution-like kernels and notable end-to-end improvements. By combining flexible pattern rules, application-specific targeting, and support for multiple backends, the approach broadens accelerator applicability beyond traditional MatMul-centric workflows. The result is a practical framework for deploying tensor accelerators in domain-specific pipelines with minimal code changes and high performance potential.
Abstract
Tensor accelerators now represent a growing share of compute resources in modern CPUs and GPUs. However, they are hard to program, leading developers to use vendor-provided kernel libraries that support tensor accelerators. As a result, the usage of tensor accelerators is limited to the provided interface, mainly designed for traditional ML and scientific computing workloads. In this paper, we show that tensor accelerators can improve the performance of applications beyond simple variants of MatMul. For example, many image processing pipelines are linear transformations over matrices in disguise and can therefore utilize such specialized hardware. This is nonetheless hindered by the difficulties in programming tensor accelerators. We tackle this problem with compiler-based techniques. We use the Halide user-schedulable language and express operations as Halide algorithms succinctly. To this end, we implement a flexible tensor instruction selector based on equality saturation. The tensor instruction selector supports both CPU- and GPU-attached tensor accelerators and works with existing scheduling operations (e.g., producer-consumer fusion). Together, this enables developers to write diverse accelerator-leveraging applications in a few dozen lines. Using our system, we demonstrate the potential of tensor accelerators beyond their traditional domains. We implement several image processing pipelines (e.g., filtering, resampling, and denoising) in our system and evaluate them against non-accelerator-leveraging baselines. We show that these pipelines can achieve significant speedups. For example, a downsampling routine is sped up by $6.1\times$ by utilizing Tensor Cores on an Nvidia RTX 4070 GPU.
