Differentially Private and Communication Efficient Large Language Model Split Inference via Stochastic Quantization and Soft Prompt

Yujie Gu; Richeng Jin; Xiaoyu Ji; Yier Jin; Wenyuan Xu

Differentially Private and Communication Efficient Large Language Model Split Inference via Stochastic Quantization and Soft Prompt

Yujie Gu, Richeng Jin, Xiaoyu Ji, Yier Jin, Wenyuan Xu

TL;DR

DEL introduces a statistically rigorous framework for privacy-preserving LLM split inference that minimizes communication while preserving generation and understanding utility. It combines user-side embedding projection with a stochastic $n$-bit quantization mechanism that achieves $f$-DP via $ ext{GDP}$ guarantees, and mitigates utility loss through server-side soft prompts that are trained exclusively on the server. Empirical results across generation and NLU benchmarks show DEL outperforms state-of-the-art privacy-preserving baselines in privacy-utility and reduces communication cost by up to a factor of 32. The approach eliminates the need for costly local denoisers, enabling practical private LLM inference in resource-constrained environments with strong privacy guarantees.

Abstract

Large Language Models (LLMs) have achieved remarkable performance and received significant research interest. The enormous computational demands, however, hinder the local deployment on devices with limited resources. The current prevalent LLM inference paradigms require users to send queries to the service providers for processing, which raises critical privacy concerns. Existing approaches propose to allow the users to obfuscate the token embeddings before transmission and utilize local models for denoising. Nonetheless, transmitting the token embeddings and deploying local models may result in excessive communication and computation overhead, preventing practical implementation. In this work, we propose \textbf{DEL}, a framework for \textbf{D}ifferentially private and communication \textbf{E}fficient \textbf{L}LM split inference. More specifically, an embedding projection module and a differentially private stochastic quantization mechanism are proposed to reduce the communication overhead in a privacy-preserving manner. To eliminate the need for local models, we adapt soft prompt at the server side to compensate for the utility degradation caused by privacy. To the best of our knowledge, this is the first work that utilizes soft prompt to improve the trade-off between privacy and utility in LLM inference, and extensive experiments on text generation and natural language understanding benchmarks demonstrate the effectiveness of the proposed method.

Differentially Private and Communication Efficient Large Language Model Split Inference via Stochastic Quantization and Soft Prompt

TL;DR

-bit quantization mechanism that achieves

-DP via

guarantees, and mitigates utility loss through server-side soft prompts that are trained exclusively on the server. Empirical results across generation and NLU benchmarks show DEL outperforms state-of-the-art privacy-preserving baselines in privacy-utility and reduces communication cost by up to a factor of 32. The approach eliminates the need for costly local denoisers, enabling practical private LLM inference in resource-constrained environments with strong privacy guarantees.

Abstract

Paper Structure (23 sections, 3 theorems, 39 equations, 7 figures, 9 tables)

This paper contains 23 sections, 3 theorems, 39 equations, 7 figures, 9 tables.

Introduction
Related Work
Privacy Protection for LLMs
Soft Prompt
Preliminaries and Problem Setup
$f$-Differential Privacy
Threat Model and Problem Formulation
Methodology
Privacy Protection mechanism
Utility Preserving Mechanism
Experiments
Experiments Setup
Performance on Open-ended Text Generation Tasks
Performance on NLU Tasks
Transferability of Soft Prompt
...and 8 more sections

Key Result

Theorem 4.2

For a given vector $\bm{v}_i = [v_{i,1}, \dots, v_{i,d}] \in [-c, c]^d$, $\hat{\bm{v}}_i=\mathcal{M}^{\text{sto}}(\bm{v}_i; A, n)$ is an unbiased estimator of $\bm{v}_i$ with the variance $\operatorname{Var}(\mathcal{M}^{\text{sto}}(\bm{v}_{i}; A, n)) =\frac{dA^2-\|\bm{v}_i\|_2^2}{2^n-1}$. Moreover, in which

Figures (7)

Figure 1: Illustration of privacy leakage during the LLM inference.
Figure 2: An overview of the proposed differentially private and utility-preserving LLM inference framework.
Figure 3: Comparison between the Proposed Method and the Gaussian Mechanism on the PTB Dataset using Qwen2.5-7B.
Figure 4: Comparison of Attack Success Rates (ASR) between embedding inversion and input inference attacks across different LLMs, datasets, and DP mechanisms.
Figure 5: An overview of the SnD framework mai2024split for NLU tasks.
...and 2 more figures

Theorems & Definitions (8)

Definition 3.1: Trade-off function dong2022gaussian
Definition 3.2: $f$-DP dong2022gaussian
Definition 4.1: Stochastic $n$-bit Quantization
Theorem 4.2
Remark 4.3
proof
Lemma 1.1
Lemma 1.2: dong2022gaussian

Differentially Private and Communication Efficient Large Language Model Split Inference via Stochastic Quantization and Soft Prompt

TL;DR

Abstract

Differentially Private and Communication Efficient Large Language Model Split Inference via Stochastic Quantization and Soft Prompt

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (8)