PADRe: A Unifying Polynomial Attention Drop-in Replacement for Efficient Vision Transformer

Pierre-David Letourneau; Manish Kumar Singh; Hsin-Pai Cheng; Shizhong Han; Yunxiao Shi; Dalton Jones; Matthew Harper Langston; Hong Cai; Fatih Porikli

PADRe: A Unifying Polynomial Attention Drop-in Replacement for Efficient Vision Transformer

Pierre-David Letourneau, Manish Kumar Singh, Hsin-Pai Cheng, Shizhong Han, Yunxiao Shi, Dalton Jones, Matthew Harper Langston, Hong Cai, Fatih Porikli

TL;DR

The paper addresses the quadratic complexity of standard self-attention in Vision Transformers by introducing PADRe, a unified Polynomial Attention Drop-in Replacement that uses polynomial approximants and hardware-friendly Hadamard nonlinearities to achieve linear time and memory costs. PADRe provides a general, scalable framework in which many existing efficient attentions (e.g., Hyena, Mamba, SimA, Conv2Former, Castling-ViT) are shown to be instances or approximations, with a formal expression of the output as a degree-$d$ polynomial in the input. A specific PADRe implementation demonstrates comparable or improved accuracy on image classification, 2D object detection, and 3D point-cloud detection while delivering significant on-device speedups on server GPUs and mobile NPUs, including quantized int8 inference. The findings suggest PADRe as a practical, drop-in replacement for self-attention that enables efficient large-scale Vision Transformer models on resource-constrained hardware, with clear pathways to extensions such as cross-attention and rational variants.

Abstract

We present Polynomial Attention Drop-in Replacement (PADRe), a novel and unifying framework designed to replace the conventional self-attention mechanism in transformer models. Notably, several recent alternative attention mechanisms, including Hyena, Mamba, SimA, Conv2Former, and Castling-ViT, can be viewed as specific instances of our PADRe framework. PADRe leverages polynomial functions and draws upon established results from approximation theory, enhancing computational efficiency without compromising accuracy. PADRe's key components include multiplicative nonlinearities, which we implement using straightforward, hardware-friendly operations such as Hadamard products, incurring only linear computational and memory costs. PADRe further avoids the need for using complex functions such as Softmax, yet it maintains comparable or superior accuracy compared to traditional self-attention. We assess the effectiveness of PADRe as a drop-in replacement for self-attention across diverse computer vision tasks. These tasks include image classification, image-based 2D object detection, and 3D point cloud object detection. Empirical results demonstrate that PADRe runs significantly faster than the conventional self-attention (11x ~ 43x faster on server GPU and mobile NPU) while maintaining similar accuracy when substituting self-attention in the transformer models.

PADRe: A Unifying Polynomial Attention Drop-in Replacement for Efficient Vision Transformer

TL;DR

polynomial in the input. A specific PADRe implementation demonstrates comparable or improved accuracy on image classification, 2D object detection, and 3D point-cloud detection while delivering significant on-device speedups on server GPUs and mobile NPUs, including quantized int8 inference. The findings suggest PADRe as a practical, drop-in replacement for self-attention that enables efficient large-scale Vision Transformer models on resource-constrained hardware, with clear pathways to extensions such as cross-attention and rational variants.

Abstract

Paper Structure (26 sections, 2 theorems, 49 equations, 2 figures, 6 tables)

This paper contains 26 sections, 2 theorems, 49 equations, 2 figures, 6 tables.

Introduction
Related Work
PADRe Framework Approach
Linear Transformations
Nonlinearities
Optional Operations
Overall Framework
Unifying Framework
Computational Characteristics
Experiments
Implementation of PADRe
Performance Evaluation on Vision Applications
Latency Evaluation on Hardware Platforms
Ablation Study on Polynomial Degree in PADRe
Discussion
...and 11 more sections

Key Result

Lemma 1

The elements of $Z_i$ are homogeneous polynomials of degree $i$ in the input $X$.

Figures (2)

Figure 1: Our implementation of a PADRe-based attention drop-in replacement module. In our experiments, we use this to substitute the standard self-attention parts in existing transformer models. Note that this implementation approach is just a specific instance of our general PADRe framework.
Figure 2: On-device (GPU and NPU) latency comparison for PADRe vs. standard self-attention. It can be observed that the latency of self-attention escalates significantly with an increase in the number of input tokens. In contrast, PADRe demonstrates a linear growth pattern.

Theorems & Definitions (5)

Lemma 1
proof
Definition B.1
Lemma 2
proof

PADRe: A Unifying Polynomial Attention Drop-in Replacement for Efficient Vision Transformer

TL;DR

Abstract

PADRe: A Unifying Polynomial Attention Drop-in Replacement for Efficient Vision Transformer

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (5)