Encryption-Friendly LLM Architecture

Donghwan Rho; Taeseong Kim; Minje Park; Jung Woo Kim; Hyunsik Chae; Ernest K. Ryu; Jung Hee Cheon

Encryption-Friendly LLM Architecture

Donghwan Rho, Taeseong Kim, Minje Park, Jung Woo Kim, Hyunsik Chae, Ernest K. Ryu, Jung Hee Cheon

TL;DR

This paper tackles privacy concerns in personalized LLM interactions by enabling private fine-tuning and inference using homomorphic encryption. It introduces an HE-friendly transformer that combines LoRA-based fine-tuning and Gaussian kernel attention to bypass hard non-polynomial operations, achieving substantial speedups over prior encrypted transformers. Empirical results on a BERT-style encoder demonstrate 6.94x faster fine-tuning and 2.3x faster inference with negligible accuracy loss on downstream tasks, highlighting the practicality of privacy-preserving LLM services. The work lays a foundation for scalable encrypted NLP pipelines and points to future work on encryption-aware training and efficient HE primitives.

Abstract

Large language models (LLMs) offer personalized responses based on user interactions, but this use case raises serious privacy concerns. Homomorphic encryption (HE) is a cryptographic protocol supporting arithmetic computations in encrypted states and provides a potential solution for privacy-preserving machine learning (PPML). However, the computational intensity of transformers poses challenges for applying HE to LLMs. In this work, we propose a modified HE-friendly transformer architecture with an emphasis on inference following personalized (private) fine-tuning. Utilizing LoRA fine-tuning and Gaussian kernels, we achieve significant computational speedups -- 6.94x for fine-tuning and 2.3x for inference -- while maintaining performance comparable to plaintext models. Our findings provide a viable proof of concept for offering privacy-preserving LLM services in areas where data protection is crucial. Our code is available on GitHub.

Encryption-Friendly LLM Architecture

TL;DR

Abstract

Paper Structure (49 sections, 12 equations, 9 figures, 16 tables, 5 algorithms)

This paper contains 49 sections, 12 equations, 9 figures, 16 tables, 5 algorithms.

Introduction
Prior work
Transformer-based language models and LoRA.
Privacy-preserving transformer using HE.
Contributions
Server-client computation model and preliminaries
Server-client computation model
Homomorphic encryption and CKKS
Matrix multiplications: PCMM and CCMM.
Polynomial approximations of non-polynomial functions.
Large language models, attention layers, and LoRA fine-tuning
Speedup with LoRA: Avoiding large CCMM
Bottleneck 1: Full fine-tuning incurs large CCMM.
Accelerating homomorphic matrix-multiplication with LoRA.
Reducing optimizer states and inverse square root with LoRA.
...and 34 more sections

Figures (9)

Figure 1: Proposed privacy-preserving LLM under homomorphic encryption (HE). HE cryptographically protects userâ€™s fine-tuning and inference data. We resolve two computational bottlenecks. First, we reduce the size of ciphertext-ciphertext matrix multiplication (CCMM) using LoRA fine-tuning. Second, we avoid the softmax computation, which is notoriously challenging to compute under HE, and replace it with a much simpler Gaussian kernel (GK).
Figure 2: Full fine-tuning.
Figure 3: LoRA fine-tuning.
Figure 4: Row-wise packing method for matrix representation, utilizing zero-padding for non-square matrices, followed by block-wise matrix multiplication for efficient processing of large matrices.
Figure 5: LoRA-friendly packing is used when the given matrix has one long and one short size. Split & repeat row-wise divides and copies each row into ciphertexts, which is used during LoRA CCMMs and $a_{i}$ denotes $i$-th row of matrix $a$. The shaded block matrices represent zero-padded blocks.
...and 4 more figures

Encryption-Friendly LLM Architecture

TL;DR

Abstract

Encryption-Friendly LLM Architecture

Authors

TL;DR

Abstract

Table of Contents

Figures (9)