Table of Contents
Fetching ...

QuAKE: Speeding up Model Inference Using Quick and Approximate Kernels for Exponential Non-Linearities

Sai Kiran Narayanaswami, Gopalakrishnan Srinivasan, Balaraman Ravindran

TL;DR

QuAKE targets inference latency by accelerating exponential nonlinearities like Softmax and GELU through a family of approximate exponential operators that exploit IEEE-754 bit representations and require no special hardware. The core idea is to approximate exponentials with affine input transforms parameterized by $c_0$ and $c_1$, and to further refine accuracy with a second-order variant (QuAKE2) that preserves continuity and reduces error to roughly 0.34%. Empirically, QuAKE yields substantial speedups (roughly 10–35% on server CPUs and 5–45% on embedded/mobile CPUs) across a broad set of models and tasks, with downstream task performance remaining largely intact; QuAKE2 often matches or improves accuracy while maintaining speedups. The work demonstrates wide applicability, including CNNs and Transformers, and highlights the practical potential for real-time and edge AI deployments by lowering the computation of pivotal nonlinearities.

Abstract

As machine learning gets deployed more and more widely, and model sizes continue to grow, improving computational efficiency during model inference has become a key challenge. In many commonly used model architectures, including Transformers, a significant portion of the inference computation is comprised of exponential non-linearities such as Softmax. In this work, we develop QuAKE, a collection of novel operators that leverage certain properties of IEEE-754 floating point representations to quickly approximate the exponential function without requiring specialized hardware, extra memory, or precomputation. We propose optimizations that enhance the efficiency of QuAKE in commonly used exponential non-linearities such as Softmax, GELU, and the Logistic function. Our benchmarks demonstrate substantial inference speed improvements between 10% and 35% on server CPUs, and 5% and 45% on embedded and mobile-scale CPUs for a variety of model architectures and sizes. Evaluations of model performance on standard datasets and tasks from various domains show that QuAKE operators are able to provide sizable speed benefits with little to no loss of performance on downstream tasks.

QuAKE: Speeding up Model Inference Using Quick and Approximate Kernels for Exponential Non-Linearities

TL;DR

QuAKE targets inference latency by accelerating exponential nonlinearities like Softmax and GELU through a family of approximate exponential operators that exploit IEEE-754 bit representations and require no special hardware. The core idea is to approximate exponentials with affine input transforms parameterized by and , and to further refine accuracy with a second-order variant (QuAKE2) that preserves continuity and reduces error to roughly 0.34%. Empirically, QuAKE yields substantial speedups (roughly 10–35% on server CPUs and 5–45% on embedded/mobile CPUs) across a broad set of models and tasks, with downstream task performance remaining largely intact; QuAKE2 often matches or improves accuracy while maintaining speedups. The work demonstrates wide applicability, including CNNs and Transformers, and highlights the practical potential for real-time and edge AI deployments by lowering the computation of pivotal nonlinearities.

Abstract

As machine learning gets deployed more and more widely, and model sizes continue to grow, improving computational efficiency during model inference has become a key challenge. In many commonly used model architectures, including Transformers, a significant portion of the inference computation is comprised of exponential non-linearities such as Softmax. In this work, we develop QuAKE, a collection of novel operators that leverage certain properties of IEEE-754 floating point representations to quickly approximate the exponential function without requiring specialized hardware, extra memory, or precomputation. We propose optimizations that enhance the efficiency of QuAKE in commonly used exponential non-linearities such as Softmax, GELU, and the Logistic function. Our benchmarks demonstrate substantial inference speed improvements between 10% and 35% on server CPUs, and 5% and 45% on embedded and mobile-scale CPUs for a variety of model architectures and sizes. Evaluations of model performance on standard datasets and tasks from various domains show that QuAKE operators are able to provide sizable speed benefits with little to no loss of performance on downstream tasks.

Paper Structure

This paper contains 23 sections, 7 equations, 7 tables, 2 algorithms.