Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs
Nir Ailon, Akhiad Bercovich, Yahel Uffenheimer, Omri Weinstein
TL;DR
The paper tackles the bottleneck of MatMul in deep networks by introducing Strassen-Tile (STL), a GPU-native bilinear operator that operates on tiles of weight and activation matrices through learnable encoders and a decoder. STL reduces FLOPs substantially while preserving or increasing parameter count, demonstrated via tile-based approximations of 4×4 MatMuls and improvements in Imagenet-1K accuracy for a SoTA ViT model, with observed wall-clock speedups on GPUs. The work grounds STL in Strassen normal forms, analyzes its FLOPs/IO complexity, and details a GPU-friendly implementation that decomposes the compute into per-tile MatMuls over encoded tiles. Empirically, STL shows promise in under-parameterized regimes, with initialization and training strategy playing crucial roles, and points toward future exploration on larger architectures and specialized kernels. Overall, STL is a compelling building block toward scalable and cost-efficient AI, balancing speed, accuracy, and parameterization.
Abstract
Modern AI relies on huge matrix multiplications (MatMuls), whose computation poses a scalability problem for inference and training. We propose an alternative, GPU native bilinear operator to MatMuls in neural networks, which offers a three-way tradeoff between: speed, accuracy and parameter count. In particular, this operator requires substantially fewer FLOPs to evaluate ($\ll n^3$), yet increases the parameter count compared to MatMul ($\gg n^2$). We call this operator Strassen-Tile (STL). The key idea behind STL is a local learnable change-of-basis, applied on tiles of the weight and activation matrices, followed by an element-wise product between the tiles, implemented simultaneously via MatMul. The key technical question we study is how to optimize the change-of-basis of a given layer, which is a highly non-convex problem. We show that theory-backed initializations (inspired by fast matrix and polynomial multiplication) lead to substantially better accuracy than random SGD initialization. This phenomenon motivates further algorithmic study of STL optimization in DNNs. Our experiments demonstrate that STL can approximate 4x4 MatMul of tiles while reducing FLOPs by a factor of 2.66, and can improve Imagenet-1K accuracy of SoTA T2T-ViT-7 (4.3M parameters) while lowering FLOPs. Even with non-CUDA optimized PyTorch code, STL achieves wall-clock speedups in the compute-bound regime. These results, together with its theoretical grounds, suggest STL as a promising building block for scalable and cost-efficient AI.
