Table of Contents
Fetching ...

The Inhibitor: ReLU and Addition-Based Attention for Efficient Transformers under Fully Homomorphic Encryption on the Torus

Rickard Brännvall, Andrei Stoian

TL;DR

The paper introduces the Inhibitor attention, an addition- and ReLU-based mechanism that replaces dot-product attention and Softmax in Transformer blocks to enable efficient quantized inference and privacy-preserving computation under Fully Homomorphic Encryption. By computing a Manhattan-distance based score $Z_{ij}$ and applying a ReLU-based inhibition $H_{ik}$, the approach eliminates substantial variable multiplications and nonlinearities that are costly in FHE. Empirical results on four tasks show comparable performance to standard attention, with promising plaintext speedups (roughly 30–50%) and encryption speedups (3–6x) observed in scaling experiments. The work also provides practical implementation guidance, including signed variants, memory-efficient formulations, and parameter considerations for TFHE with PBS, highlighting the potential for scalable privacy-preserving AI on constrained hardware and encrypted data.

Abstract

To enhance the computational efficiency of quantized Transformers, we replace the dot-product and Softmax-based attention with an alternative mechanism involving addition and ReLU activation only. This side-steps the expansion to double precision often required by matrix multiplication and avoids costly Softmax evaluations but maintains much of the core functionality of conventional dot-product attention. It can enable more efficient execution and support larger quantized Transformer models on resource-constrained hardware or alternative arithmetic systems like homomorphic encryption. Training experiments on four common benchmark tasks show test set prediction scores comparable to those of conventional Transformers with dot-product attention. Our scaling experiments also suggest significant computational savings, both in plaintext and under encryption. In particular, we believe that the ReLU and addition-based attention mechanism examined in this paper may enable privacy-preserving AI applications operating under homomorphic encryption by avoiding the costly multiplication of encrypted variables.

The Inhibitor: ReLU and Addition-Based Attention for Efficient Transformers under Fully Homomorphic Encryption on the Torus

TL;DR

The paper introduces the Inhibitor attention, an addition- and ReLU-based mechanism that replaces dot-product attention and Softmax in Transformer blocks to enable efficient quantized inference and privacy-preserving computation under Fully Homomorphic Encryption. By computing a Manhattan-distance based score and applying a ReLU-based inhibition , the approach eliminates substantial variable multiplications and nonlinearities that are costly in FHE. Empirical results on four tasks show comparable performance to standard attention, with promising plaintext speedups (roughly 30–50%) and encryption speedups (3–6x) observed in scaling experiments. The work also provides practical implementation guidance, including signed variants, memory-efficient formulations, and parameter considerations for TFHE with PBS, highlighting the potential for scalable privacy-preserving AI on constrained hardware and encrypted data.

Abstract

To enhance the computational efficiency of quantized Transformers, we replace the dot-product and Softmax-based attention with an alternative mechanism involving addition and ReLU activation only. This side-steps the expansion to double precision often required by matrix multiplication and avoids costly Softmax evaluations but maintains much of the core functionality of conventional dot-product attention. It can enable more efficient execution and support larger quantized Transformer models on resource-constrained hardware or alternative arithmetic systems like homomorphic encryption. Training experiments on four common benchmark tasks show test set prediction scores comparable to those of conventional Transformers with dot-product attention. Our scaling experiments also suggest significant computational savings, both in plaintext and under encryption. In particular, we believe that the ReLU and addition-based attention mechanism examined in this paper may enable privacy-preserving AI applications operating under homomorphic encryption by avoiding the costly multiplication of encrypted variables.
Paper Structure (11 sections, 11 equations, 4 tables)