Table of Contents
Fetching ...

Power-Softmax: Towards Secure LLM Inference over Encrypted Data

Itamar Zimerman, Allon Adir, Ehud Aharoni, Matan Avitan, Moran Baruch, Nir Drucker, Jenny Lerner, Ramy Masalha, Reut Meiri, Omri Soceanu

TL;DR

This work introduces the first polynomial LLMs with 32 layers and over a billion parameters, exceeding the size of previous models by more than tenfold, and demonstrates reasoning and in-context learning capabilities comparable to standard transformers of the same size, representing a breakthrough in the field.

Abstract

Modern cryptographic methods for implementing privacy-preserving LLMs such as Homomorphic Encryption (HE) require the LLMs to have a polynomial form. Forming such a representation is challenging because Transformers include non-polynomial components, such as Softmax and layer normalization. Previous approaches have either directly approximated pre-trained models with large-degree polynomials, which are less efficient over HE, or replaced non-polynomial components with easier-to-approximate primitives before training, e.g., Softmax with pointwise attention. The latter approach might introduce scalability challenges. We present a new HE-friendly variant of self-attention that offers a stable form for training and is easy to approximate with polynomials for secure inference. Our work introduces the first polynomial LLMs with 32 layers and over a billion parameters, exceeding the size of previous models by more than tenfold. The resulting models demonstrate reasoning and in-context learning (ICL) capabilities comparable to standard transformers of the same size, representing a breakthrough in the field. Finally, we provide a detailed latency breakdown for each computation over encrypted data, paving the way for further optimization, and explore the differences in inductive bias between transformers relying on our HE-friendly variant and standard transformers. Our code is attached as a supplement.

Power-Softmax: Towards Secure LLM Inference over Encrypted Data

TL;DR

This work introduces the first polynomial LLMs with 32 layers and over a billion parameters, exceeding the size of previous models by more than tenfold, and demonstrates reasoning and in-context learning capabilities comparable to standard transformers of the same size, representing a breakthrough in the field.

Abstract

Modern cryptographic methods for implementing privacy-preserving LLMs such as Homomorphic Encryption (HE) require the LLMs to have a polynomial form. Forming such a representation is challenging because Transformers include non-polynomial components, such as Softmax and layer normalization. Previous approaches have either directly approximated pre-trained models with large-degree polynomials, which are less efficient over HE, or replaced non-polynomial components with easier-to-approximate primitives before training, e.g., Softmax with pointwise attention. The latter approach might introduce scalability challenges. We present a new HE-friendly variant of self-attention that offers a stable form for training and is easy to approximate with polynomials for secure inference. Our work introduces the first polynomial LLMs with 32 layers and over a billion parameters, exceeding the size of previous models by more than tenfold. The resulting models demonstrate reasoning and in-context learning (ICL) capabilities comparable to standard transformers of the same size, representing a breakthrough in the field. Finally, we provide a detailed latency breakdown for each computation over encrypted data, paving the way for further optimization, and explore the differences in inductive bias between transformers relying on our HE-friendly variant and standard transformers. Our code is attached as a supplement.

Paper Structure

This paper contains 21 sections, 10 equations, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: Comparison of $\operatorname{Softmax}$ and $\operatorname{PowerSoftmax}$ normalization on normally distributed values on the left, uniformly distributed values in the middle, and evenly spaced values on the right. As can be seen, the empirical scaling trends are relatively similar.
  • Figure 2: Our Attention Variants: (Left) the $\operatorname{Softmax}$-based attention mechanism using the generalized attention formulation (Eq. \ref{['eq:GenerelizedAttention']}). (Middle) Our variant for training (purple), builds on the stable variant from Eq. \ref{['eq:stableeVariant']} and the Lipschitz division from Eq. \ref{['eq:LipschitzPolyAttention']}. (Right) During secure inference with the polynomial model (red), we use a length-agnostic approximation for division, as described in Eq. \ref{['eq:LengthAgnosticAttention']}.
  • Figure 3: Latency Over HE: Time in seconds for main transformer primitives (bars, total = 91%) accumulated across 32 layers. Each bar shows the latency breakdown of the underlying HE operations.
  • Figure 4: Training Curves for : Comparison of test perplexity for transformers with $\operatorname{Softmax}$ and power normalization when trained over several datasets including Pile, Wikitext-103, and Text-8.
  • Figure 5: Results On Vision Tasks. Training curves for ViT Variants with $\operatorname{PowerSoftmax}$ (red) and the $\operatorname{Softmax}$ baseline (blue). On the left, results are presented for Tiny-ImageNet and on the middle and right for CIFAR-100 and CIFAR-10 accordingly.
  • ...and 8 more figures