Table of Contents
Fetching ...

Efficient Decoding Methods for Language Models on Encrypted Data

Matan Avitan, Moran Baruch, Nir Drucker, Itamar Zimerman, Yoav Goldberg

TL;DR

This work tackles the challenge of privacy-preserving LLM decoding under homomorphic encryption by introducing polynomial, differentiable decoding primitives that avoid expensive encrypted comparisons. The core contributions are CutMax, a guaranteed-convergence HE-friendly argmax, and an HE-compatible nucleus (top-$p$) sampling method that leverages CutMax for efficient stochastic decoding. The authors provide theoretical convergence guarantees based on exponential amplification of the gap between the maximum and runner-up elements and demonstrate substantial latency reductions (24x–35x) with exact recovery on large vocabularies, alongside a differentiable framework for end-to-end gradient-based sequence optimization. Together, these advances enable practical, secure text generation on encrypted data and pave the way for scalable privacy-preserving LLM deployment, while highlighting the remaining cost components of encrypted inference and the need for auditing and bias mitigation in secure settings.

Abstract

Large language models (LLMs) power modern AI applications, but processing sensitive data on untrusted servers raises privacy concerns. Homomorphic encryption (HE) enables computation on encrypted data for secure inference. However, neural text generation requires decoding methods like argmax and sampling, which are non-polynomial and thus computationally expensive under encryption, creating a significant performance bottleneck. We introduce cutmax, an HE-friendly argmax algorithm that reduces ciphertext operations compared to prior methods, enabling practical greedy decoding under encryption. We also propose the first HE-compatible nucleus (top-p) sampling method, leveraging cutmax for efficient stochastic decoding with provable privacy guarantees. Both techniques are polynomial, supporting efficient inference in privacy-preserving settings. Moreover, their differentiability facilitates gradient-based sequence-level optimization as a polynomial alternative to straight-through estimators. We further provide strong theoretical guarantees for cutmax, proving its convergence via exponential amplification of the gap ratio between the maximum and runner-up elements. Evaluations on realistic LLM outputs show latency reductions of 24x-35x over baselines, advancing secure text generation.

Efficient Decoding Methods for Language Models on Encrypted Data

TL;DR

This work tackles the challenge of privacy-preserving LLM decoding under homomorphic encryption by introducing polynomial, differentiable decoding primitives that avoid expensive encrypted comparisons. The core contributions are CutMax, a guaranteed-convergence HE-friendly argmax, and an HE-compatible nucleus (top-) sampling method that leverages CutMax for efficient stochastic decoding. The authors provide theoretical convergence guarantees based on exponential amplification of the gap between the maximum and runner-up elements and demonstrate substantial latency reductions (24x–35x) with exact recovery on large vocabularies, alongside a differentiable framework for end-to-end gradient-based sequence optimization. Together, these advances enable practical, secure text generation on encrypted data and pave the way for scalable privacy-preserving LLM deployment, while highlighting the remaining cost components of encrypted inference and the need for auditing and bias mitigation in secure settings.

Abstract

Large language models (LLMs) power modern AI applications, but processing sensitive data on untrusted servers raises privacy concerns. Homomorphic encryption (HE) enables computation on encrypted data for secure inference. However, neural text generation requires decoding methods like argmax and sampling, which are non-polynomial and thus computationally expensive under encryption, creating a significant performance bottleneck. We introduce cutmax, an HE-friendly argmax algorithm that reduces ciphertext operations compared to prior methods, enabling practical greedy decoding under encryption. We also propose the first HE-compatible nucleus (top-p) sampling method, leveraging cutmax for efficient stochastic decoding with provable privacy guarantees. Both techniques are polynomial, supporting efficient inference in privacy-preserving settings. Moreover, their differentiability facilitates gradient-based sequence-level optimization as a polynomial alternative to straight-through estimators. We further provide strong theoretical guarantees for cutmax, proving its convergence via exponential amplification of the gap ratio between the maximum and runner-up elements. Evaluations on realistic LLM outputs show latency reductions of 24x-35x over baselines, advancing secure text generation.

Paper Structure

This paper contains 31 sections, 7 theorems, 15 equations, 7 figures, 4 tables, 3 algorithms.

Key Result

Lemma 3.2

For $X \in R^k$, let $G\xspace_i \sim \mathrm{Gumbel}(0,1)$ be independent for each $i$. Then the random variable $Y\xspace = \operatorname{argmax}\xspace_{i} \left( X_i + G\xspace_i \right)$ follows a categorical distribution with probabilities given by,

Figures (7)

  • Figure 1: Scope: Secure LLM generation over the clients' encrypted data using HE (model is not necessarily encrypted). Here, standard decoding methods like $\operatorname{argmax}$ and $\operatorname{sampling}$ (red) for selecting the next token are not HE-friendly or considered inefficient. We introduce efficient and scalable HE-friendly methods to evaluate them under encryption.
  • Figure 2: Illustration of CutMax First Iteration on a Normal Distribution$\operatorname{CutMax}$ algorithm first iteration illustration, applied to the original random variable $X \sim{} \mathrm{Normal}(170, 10)$ (a). We first standardize and shift $X$ using $Y=\frac{X-\mu_X}{c\sigma_X} + 1$ with $c=5$ (b), then raise $Y$ to the power $p=3$ (c) or $p=7$ (d) to induce right skew. The resulting distribution $Z=Y^7$ (d) has a maximum normalized height $\max \bigl(Z / \sum_i Z_i \bigr) = 0.0033$, indicating a strong right‐tail skew. A vertical black line indicates the mean of every distribution and gray lines indicate the first standard deviation.
  • Figure 3: Evolution of Logits Below the Mean Across CutMax Iterations (After \ref{['alg:cutmax']}, \ref{['line:cut_y_t']}). We ran CutMax for $T=3$ with the best hyperparameters $(p,c)=(19,27)$, found via grid search, on 1,000 arXiv articles passed through GPT-2 (vocabulary size $|\mathcal{V}|=50{,}257$Radford2019LanguageMA). Bars show the mean$\pm$std of the fraction of entries ${Y}^{(t)}_i < \mathbb E[{Y}^{(t)}]$ at each iteration $t$.
  • Figure 4: Violation Rates in Nucleus Sampling Methods: Violation rate (%) of sampling outside the top-$p$ nucleus for Gumbel-Max and Nucleus($\beta$-cut), averaged over 100 prompts with 1000 draws each ($p=0.9$). Error bars denote one standard deviation.
  • Figure 5: Comparison of Sampling Methods on GPT-2 Last-Token Decoding: The top strip shows the complete 50,257‑dimensional logit vector (only its outline is visible at this scale), and the second strip zooms in on the five largest logits. The three lower strips plot, for 1,000 repeated draws, histograms of the 30 most frequently chosen token indices produced by four different samplers: standard $\operatorname{Softmax}$ at temperature 1, the classical Gumbel‑Max trick, inverse‑transform sampling on the normalized exponentials (“Exp‑inverse”), and our one‑shot Beta‑cut nucleus sampler, which perturbs each logit with i.i.d. $G_i\!\sim\!\mathrm{Beta}(\alpha,1)$ noise using $\alpha=0.486$ computed from \ref{['obs:nucleus']} for $(p,q)=(0.9,0.95)$. Whereas the first three methods still allocate non‑negligible mass to tail tokens, the Beta‑cut histogram is sharply truncated, confirming that all probability remains inside the top‑$p$ nucleus exactly as required by the HE‑friendly design of \ref{['method:he_nucelus']}.
  • ...and 2 more figures

Theorems & Definitions (19)

  • Definition 3.1
  • Lemma 3.2: Gumbel-Max Trick maddison2014
  • Lemma 3.3
  • proof
  • Definition E.1: Max elements
  • Definition E.2: $\delta$-gap bound
  • Definition E.3: Gap Ratio
  • Definition E.4: $c$-semi-standardization
  • Definition E.5: $\operatorname{CutMax}$ Iteration
  • Lemma E.6
  • ...and 9 more