Table of Contents
Fetching ...

Confidential Prompting: Privacy-preserving LLM Inference on Cloud

Caihua Li, In Gim, Lin Zhong

TL;DR

This work addresses the privacy risks of cloud-hosted LLM inference by protecting user prompts from untrusted cloud and LLM providers. It introduces Petridish, a system that runs the LLM inside a confidential VM and employs Secure Partitioned Decoding to separate per-user input processing from batched decoding, preserving model confidentiality and output fidelity while enabling auditable protection. The authors formalize a lossless attention partitioning approach, implement a prototype on Nvidia GPU-accelerated CC hardware, and show that SPD delivers scalable, efficient performance with reduced latency compared to per-user isolated deployments. Overall, Petridish demonstrates a practical path toward privacy-preserving, auditable, cloud-based LLM services suitable for handling sensitive data such as clinical or financial records, without sacrificing utility.

Abstract

This paper introduces a vision of confidential prompting: securing user prompts from an untrusted, cloud-hosted large language model (LLM) while preserving model confidentiality, output invariance, and compute efficiency. As a first step toward this vision, we present Petridish, a system built on top of confidential computing and its core contribution, a novel technology called Secure Partitioned Decoding (SPD). Petridish runs the LLM service inside a confidential virtual machine (CVM), which protects the secrets, i.e., the LLM parameters and user prompts, from adversaries outside the CVM. Importantly, it splits the LLM service for a user into two processes, using SPD: a per-user process performs prefill with the user prompts and computes attention scores during decoding; a service process, shared by all users, batches the attention scores from per-user processes and generates output tokens for all users. Both the LLM provider and the users trust Petridish's CVM and its operating system, which guarantees isolation between processes and limits their outbound network capabilities to control information flow. The CVM's attestation capability and its open-source software stack enable Petridish to provide auditable protection of both user prompt and LLM confidentiality. Together, Petridish maintains full utility of LLM service and enables practical, privacy-preserving cloud-hosted LLM inference for sensitive applications, such as processing personal data, clinical records, and financial documents.

Confidential Prompting: Privacy-preserving LLM Inference on Cloud

TL;DR

This work addresses the privacy risks of cloud-hosted LLM inference by protecting user prompts from untrusted cloud and LLM providers. It introduces Petridish, a system that runs the LLM inside a confidential VM and employs Secure Partitioned Decoding to separate per-user input processing from batched decoding, preserving model confidentiality and output fidelity while enabling auditable protection. The authors formalize a lossless attention partitioning approach, implement a prototype on Nvidia GPU-accelerated CC hardware, and show that SPD delivers scalable, efficient performance with reduced latency compared to per-user isolated deployments. Overall, Petridish demonstrates a practical path toward privacy-preserving, auditable, cloud-based LLM services suitable for handling sensitive data such as clinical or financial records, without sacrificing utility.

Abstract

This paper introduces a vision of confidential prompting: securing user prompts from an untrusted, cloud-hosted large language model (LLM) while preserving model confidentiality, output invariance, and compute efficiency. As a first step toward this vision, we present Petridish, a system built on top of confidential computing and its core contribution, a novel technology called Secure Partitioned Decoding (SPD). Petridish runs the LLM service inside a confidential virtual machine (CVM), which protects the secrets, i.e., the LLM parameters and user prompts, from adversaries outside the CVM. Importantly, it splits the LLM service for a user into two processes, using SPD: a per-user process performs prefill with the user prompts and computes attention scores during decoding; a service process, shared by all users, batches the attention scores from per-user processes and generates output tokens for all users. Both the LLM provider and the users trust Petridish's CVM and its operating system, which guarantees isolation between processes and limits their outbound network capabilities to control information flow. The CVM's attestation capability and its open-source software stack enable Petridish to provide auditable protection of both user prompt and LLM confidentiality. Together, Petridish maintains full utility of LLM service and enables practical, privacy-preserving cloud-hosted LLM inference for sensitive applications, such as processing personal data, clinical records, and financial documents.
Paper Structure (43 sections, 1 theorem, 7 equations, 7 figures)

This paper contains 43 sections, 1 theorem, 7 equations, 7 figures.

Key Result

Theorem 1

Let $Q\in\mathbb{R}^{d}$, $K = \text{concat}(K_{in}\xspace, K_{out}\xspace)\in\mathbb{R}^{len\times d}$, $V = \text{concat}(V_{in}\xspace, V_{out}\xspace)\in\mathbb{R}^{len\times d}$, where $len$ be the number of input and output tokens, and $\sigma$ be the softmax function. where $\gamma_{in}\xspace,\gamma_{out}\xspace$ are denominators of each softmax operation, e.g.$\gamma_{in}\xspace=\sum\tex

Figures (7)

  • Figure 1: Petridish Overview. Both users and the LLM provider audit the open-source software stack (colored in grey) and verify the execution environment (e.g., challenge performed by User A) before transmitting any secrets via secure encrypted channels. The Process Controller initializes a dedicated process for each user and the LLM provider, which executes within the CVM and on top of the trusted OS. The CVM prevents illegal access from outside the CVM, while the trusted OS guarantees isolation between processes. The per-user processes separately prepare their own input KV cache during prefill, and interact with the service process to generate output tokens using SPD. After decoding, the Process Controller relays output tokens from the service process to the corresponding users.
  • Figure 2: Various confidential inference approaches. (a) LLM provider deploys a LLM service in its CVM to serve multiple users, which defends against adversaries outside the CVM, but the LLM provider still gets user prompts in plaintext. (b) Each user deploys a dedicated LLM service in its own CVM, which secures user prompts but not LLM parameters, and is inefficient due to lack of batch parallelism and large memory footprint. (c) In an auditable trustworthy CVM, each per-user process runs a dedicated LLM service. This approach secures both user prompts and LLM parameters, but is still inefficient due to lack of batch parallelism and large memory footprint. (d) SPD strikes a balance between security and efficiency by isolating user prompts within per-user processes, while allowing the single LLM service to batch decode for all users.
  • Figure 3: Overview of SPD on a simplified Transformer layer. The squares in blue and red represent the KV cache associated with different users while the gray squares represent new tokens. With or without shade indicate it is the output or input KV cache, respectively. ⓪ By the end of prefill, the user process finishs computing its input KV cache $K_{in}\xspace, V_{in}\xspace$, generates the first token and sends it to the service process. ① Project hidden state $X_\text{new}$ of a new token to $Q_\text{new}, K_\text{new}, V_\text{new}$. ② Append $K_\text{new}, V_\text{new}$ to the output KV cache. ③ Batch process output attention score for all users. ④ Compute input attention score in each user process. ⑤ Merge results to compute full attention score. ⑥ If it is the last layer, generate the next token, then repeat from ① until finish; otherwise continue to the next layer.
  • Figure 4: Normalized latency with varying number of users, Llama 3 (8B), 64 input and 64 output tokens. $y=1$ indicates the latency of No Protection baseline.
  • Figure 5: Average latency with varying model sizes, 8 users, 64 input and 64 output tokens. The solid and dashed bars indicate with and without MPS respectively.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Theorem 1: Secure Partitioned Attention Computation