Practical Secure Inference Algorithm for Fine-tuned Large Language Model Based on Fully Homomorphic Encryption

Zhang Ruoyan; Zheng Zhongxiang; Bao Wankang

Practical Secure Inference Algorithm for Fine-tuned Large Language Model Based on Fully Homomorphic Encryption

Zhang Ruoyan, Zheng Zhongxiang, Bao Wankang

TL;DR

The paper tackles privacy in inference for large language models by fusing Fully Homomorphic Encryption (FHE) with Parameter-Efficient Fine-Tuning (LoRA) and a novel Private Linear Layer (PLL). It splits the model into a public Open-LLM component and a private LoRA component, enabling ciphertext-only computation for the private weights while keeping the base model plaintext on the client. A mathematical security framework connects PLL to the Learning with Errors (LWE) problem, strengthening resistance to model extraction attacks. Empirically, the approach achieves practical throughput (1.61s per token for 1000 tokens) on ChatGLM2-6B with LoRA rank 8, outperforming prior privacy-inference systems and demonstrating viability for vertical-domain use cases. This work advances privacy-preserving inference by reducing ciphertext workload and providing provable-security-backed protection for private model components, making secure inference more deployable in real-world settings.

Abstract

Large language models(LLMs) are currently at the forefront of the machine learning field, which show a broad application prospect but at the same time expose some risks of privacy leakage. We combined Fully Homomorphic Encryption(FHE) and provable security theory with Parameter-Efficient Fine-Tuning(PEFT) to propose an efficient and secure inference scheme for LLMs. More specially, we focus on pre-trained LLMs which rely on open-sourced base model and then fine-tuned with the private datasets by LoRA. This is a popular road-map for Vertical Domain Models such as LawGPT and BenTsao. We use two key technologies below. Firstly, we divide the whole model into the public part and the private part. The weights of public part are publicly accessible(e.g. the open-sourced base model) while the private part needs to be protected(e.g. the LoRA matrices). In this way, the overhead brought by computing on private data can be greatly reduced. Secondly, we propose a general method to transform a linear layer into another one which provides security against model extraction attacks and preserves its original functionality, which denoted as Private Linear Layer(PLL). Then we use this method on the LoRA matrices to make sure that the server protects their private weights without restricting the user's input. We also show that the difficulty of performing model extraction attacks for PLL can be reduced to the well-known hard problem Learning with Errors(LWE). Combing this method with FHE, we can protect user's input at the same time. In this paper, we use the open-source model ChatGLM2-6B as the base model which is fine-tuned by LoRA. Experimental results show the inference efficiency of our scheme reaches 1.61s/token which displays that the scheme has good practicality.

Practical Secure Inference Algorithm for Fine-tuned Large Language Model Based on Fully Homomorphic Encryption

TL;DR

Abstract

Paper Structure (17 sections, 1 theorem, 23 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 17 sections, 1 theorem, 23 equations, 8 figures, 2 tables, 1 algorithm.

Introduction
Background
Notation
Transformer Model
LoRA(Low-Rank Adaptation)
Homomorphic Encryption
Learning with Errors
Practical Secure Inference Algorithm for Fine-tuned Large Language Model Based on FHE
Open-LLM + Private-LoRA
Private Linear Layer
Implementation of PLL based on FHE
Experiments
Experiment Settings
Improvement of Computing and Communication Efficiency
Application Scenario Setting
...and 2 more sections

Key Result

Theorem 1

For parameters $\gamma/q\ge2\sqrt{m}$, if there exists an adversary that can solve the $SolveMatrix(t,m,n,$$q,\gamma,\beta)$ problem, then it can distinguish the LWE problem, i.e., solve $CLWE(m,n,\gamma/q ,\beta/q)$.

Figures (8)

Figure 1: Transformer Architecture
Figure 2: LoRA applied to transformer with $d = 4, r = 2$
Figure 3: data transmission between client and server in the self-attention of Transformer
Figure 4: "Open-LLM + Private-LoRA” Structure
Figure 5: Security Reduction Process
...and 3 more figures

Theorems & Definitions (6)

Definition 2.1: Search-version LWE(LWE)
Definition 2.2: Decisional LWE(DLWE)
Definition 2.3: CLWE
Definition 3.1: $SolveMatrix(t,m,n, q, \gamma,\beta)$
Theorem 1
proof

Practical Secure Inference Algorithm for Fine-tuned Large Language Model Based on Fully Homomorphic Encryption

TL;DR

Abstract

Practical Secure Inference Algorithm for Fine-tuned Large Language Model Based on Fully Homomorphic Encryption

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (6)