Table of Contents
Fetching ...

Encryption-Friendly LLM Architecture

Donghwan Rho, Taeseong Kim, Minje Park, Jung Woo Kim, Hyunsik Chae, Ernest K. Ryu, Jung Hee Cheon

TL;DR

This paper tackles privacy concerns in personalized LLM interactions by enabling private fine-tuning and inference using homomorphic encryption. It introduces an HE-friendly transformer that combines LoRA-based fine-tuning and Gaussian kernel attention to bypass hard non-polynomial operations, achieving substantial speedups over prior encrypted transformers. Empirical results on a BERT-style encoder demonstrate 6.94x faster fine-tuning and 2.3x faster inference with negligible accuracy loss on downstream tasks, highlighting the practicality of privacy-preserving LLM services. The work lays a foundation for scalable encrypted NLP pipelines and points to future work on encryption-aware training and efficient HE primitives.

Abstract

Large language models (LLMs) offer personalized responses based on user interactions, but this use case raises serious privacy concerns. Homomorphic encryption (HE) is a cryptographic protocol supporting arithmetic computations in encrypted states and provides a potential solution for privacy-preserving machine learning (PPML). However, the computational intensity of transformers poses challenges for applying HE to LLMs. In this work, we propose a modified HE-friendly transformer architecture with an emphasis on inference following personalized (private) fine-tuning. Utilizing LoRA fine-tuning and Gaussian kernels, we achieve significant computational speedups -- 6.94x for fine-tuning and 2.3x for inference -- while maintaining performance comparable to plaintext models. Our findings provide a viable proof of concept for offering privacy-preserving LLM services in areas where data protection is crucial. Our code is available on GitHub.

Encryption-Friendly LLM Architecture

TL;DR

This paper tackles privacy concerns in personalized LLM interactions by enabling private fine-tuning and inference using homomorphic encryption. It introduces an HE-friendly transformer that combines LoRA-based fine-tuning and Gaussian kernel attention to bypass hard non-polynomial operations, achieving substantial speedups over prior encrypted transformers. Empirical results on a BERT-style encoder demonstrate 6.94x faster fine-tuning and 2.3x faster inference with negligible accuracy loss on downstream tasks, highlighting the practicality of privacy-preserving LLM services. The work lays a foundation for scalable encrypted NLP pipelines and points to future work on encryption-aware training and efficient HE primitives.

Abstract

Large language models (LLMs) offer personalized responses based on user interactions, but this use case raises serious privacy concerns. Homomorphic encryption (HE) is a cryptographic protocol supporting arithmetic computations in encrypted states and provides a potential solution for privacy-preserving machine learning (PPML). However, the computational intensity of transformers poses challenges for applying HE to LLMs. In this work, we propose a modified HE-friendly transformer architecture with an emphasis on inference following personalized (private) fine-tuning. Utilizing LoRA fine-tuning and Gaussian kernels, we achieve significant computational speedups -- 6.94x for fine-tuning and 2.3x for inference -- while maintaining performance comparable to plaintext models. Our findings provide a viable proof of concept for offering privacy-preserving LLM services in areas where data protection is crucial. Our code is available on GitHub.
Paper Structure (49 sections, 12 equations, 9 figures, 16 tables, 5 algorithms)

This paper contains 49 sections, 12 equations, 9 figures, 16 tables, 5 algorithms.

Figures (9)

  • Figure 1: Proposed privacy-preserving LLM under homomorphic encryption (HE). HE cryptographically protects user’s fine-tuning and inference data. We resolve two computational bottlenecks. First, we reduce the size of ciphertext-ciphertext matrix multiplication (CCMM) using LoRA fine-tuning. Second, we avoid the softmax computation, which is notoriously challenging to compute under HE, and replace it with a much simpler Gaussian kernel (GK).
  • Figure 2: Full fine-tuning.
  • Figure 3: LoRA fine-tuning.
  • Figure 4: Row-wise packing method for matrix representation, utilizing zero-padding for non-square matrices, followed by block-wise matrix multiplication for efficient processing of large matrices.
  • Figure 5: LoRA-friendly packing is used when the given matrix has one long and one short size. Split & repeat row-wise divides and copies each row into ciphertexts, which is used during LoRA CCMMs and $a_{i}$ denotes $i$-th row of matrix $a$. The shaded block matrices represent zero-padded blocks.
  • ...and 4 more figures