CryptoGen: Secure Transformer Generation with Encrypted KV-Cache Reuse

Hedong Zhang; Neusha Javidnia; Shweta Pardeshi; Qian Lou; Farinaz Koushanfar

CryptoGen: Secure Transformer Generation with Encrypted KV-Cache Reuse

Hedong Zhang, Neusha Javidnia, Shweta Pardeshi, Qian Lou, Farinaz Koushanfar

TL;DR

CryptoGen tackles privacy preserving autoregressive generation by combining fully homomorphic encryption with multi party computation to protect prompts and model parameters in a client server setting. It introduces a unified framework with heterogeneous KV cache encoding and autoregressive ciphertext ciphertext kernels (ARCC) to enable stateful, token by token generation while reusing encrypted KV caches. Key contributions include a dual encoding strategy that switches between prefilling and decoding, ARCC kernels for efficient attention over a heterogeneous KV cache, and a KV cache management protocol with lazy noise refreshing and ciphertext packing. Experimental results on GPT-2 sized models show near linear scaling with sequence length and 4.4x–7.6x per token latency reductions over prior discriminative secure inference baselines, while preserving perplexity close to plaintext generation.

Abstract

The widespread deployment of cloud-hosted generative models raises a fundamental challenge: enabling efficient autoregressive generation while preserving the privacy of both user prompts and model parameters in untrusted environments. We address this challenge in a client-server setting where an untrusted server hosts an autoregressive Transformer and the client requires cryptographic protection for both inputs and inference. We present CryptoGen, the first system to enable scalable privacy-preserving neural generation with persistent encrypted key-value (KV) cache reuse. Discriminative-task secure inference systems incur quadratic latency and memory growth when adapted to autoregressive decoding due to the lack of native encrypted KV-cache support. In contrast, CryptoGen achieves near-linear scaling by securely reusing and updating encrypted KV caches throughout generation. CryptoGen integrates homomorphic encryption and secret sharing to support both prefilling and generation. Key techniques include a unified encrypted KV-cache framework, heterogeneous SIMD encodings for different phases, optimized cipher-cipher matrix-matrix and matrix-vector operations, and efficient noise refresh and ciphertext concatenation mechanisms. Evaluation on generative Transformer models trained on WikiText-2, PTB, and LAMBADA shows that for input lengths of 128-512 tokens, CryptoGen achieves 4.4x-7.6x lower per-token latency than state-of-the-art discriminative secure inference systems, while maintaining near-linear latency and memory scaling, with advantages increasing for longer sequences. CryptoGen is released as an open-source library.

CryptoGen: Secure Transformer Generation with Encrypted KV-Cache Reuse

TL;DR

Abstract

Paper Structure (31 sections, 5 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 31 sections, 5 equations, 9 figures, 4 tables, 1 algorithm.

Introduction
Background
Auto-regressive Transformer Generation
KV Cache in Generation
Cryptographic Primitives
Data Encodings in Encrypted Computing
Motivation
Prior Work: Secure Inference for Discriminative Transformers
Why Prior Works Do Not Extend to Auto-regressive Generation
Motivation: Secure Generation with Encrypted KV Cache Reuse
CryptoGen
CryptoGen Framework
Heterogeneous Encoding based CT-PT Multiplication
Complexity Comparisons of Heterogeneous Encoding-based CPMM and priors.
Auto-Regressive Ciphertext-Ciphertext Multiplication (ARCC)
...and 16 more sections

Figures (9)

Figure 1: KV cache in auto-regressive Transformer generation.
Figure 2: Three ciphertext packing strategies in encrypted computing: (a) outer-based, (b) inner-based, and (c) diagonal encoding. Different packings expose different forms of SIMD parallelism and lead to different trade-offs between the number of ciphertexts, cross-ciphertext reductions, and rotation overhead.
Figure 3: Why outer-based encodings (including BOLT's outer--diagonal design) do not fit token-by-token decoding with an encrypted KV cache: after prefilling, each step introduces only one new token, leading to sparse ciphertext utilization or redundant recomputation.
Figure 4: Scalability comparison of secure auto-regressive generation. The latency of discriminative inference framework(BOLT) is unscalable due to the lack of KV cache reuse. CryptoGen maintains near-linear scaling with sequence lenght by using KV cache. Refer to Section \ref{['subsec:end2end']} for detailed end-to-end performance analysis.
Figure 5: CryptoGen inference workflow. It combines (i) prefilling with outer-diagonal CT$\times$PT matrix multiplication (CPMM) and (ii) autoregressive decoding with inner-diagonal CT$\times$PT vector-matrix multiplication (CPVM), bridged by a heterogeneous encrypted KV cache and MPC-based nonlinear primitives.
...and 4 more figures

CryptoGen: Secure Transformer Generation with Encrypted KV-Cache Reuse

TL;DR

Abstract

CryptoGen: Secure Transformer Generation with Encrypted KV-Cache Reuse

Authors

TL;DR

Abstract

Table of Contents

Figures (9)