Table of Contents
Fetching ...

Your Inference Request Will Become a Black Box: Confidential Inference for Cloud-based Large Language Models

Chung-ju Huang, Huiqiang Zhao, Yuanpeng He, Lijian Li, Wenpin Jiao, Zhi Jin, Peixuan Chen, Leye Wang

TL;DR

To the best of the knowledge, this is the first work that ensures clients' prompts and responses remain inaccessible to the cloud, while also preserving model privacy, performance, and efficiency.

Abstract

The increasing reliance on cloud-hosted Large Language Models (LLMs) exposes sensitive client data, such as prompts and responses, to potential privacy breaches by service providers. Existing approaches fail to ensure privacy, maintain model performance, and preserve computational efficiency simultaneously. To address this challenge, we propose Talaria, a confidential inference framework that partitions the LLM pipeline to protect client data without compromising the cloud's model intellectual property or inference quality. Talaria executes sensitive, weight-independent operations within a client-controlled Confidential Virtual Machine (CVM) while offloading weight-dependent computations to the cloud GPUs. The interaction between these environments is secured by our Reversible Masked Outsourcing (ReMO) protocol, which uses a hybrid masking technique to reversibly obscure intermediate data before outsourcing computations. Extensive evaluations show that Talaria can defend against state-of-the-art token inference attacks, reducing token reconstruction accuracy from over 97.5% to an average of 1.34%, all while being a lossless mechanism that guarantees output identical to the original model without significantly decreasing efficiency and scalability. To the best of our knowledge, this is the first work that ensures clients' prompts and responses remain inaccessible to the cloud, while also preserving model privacy, performance, and efficiency.

Your Inference Request Will Become a Black Box: Confidential Inference for Cloud-based Large Language Models

TL;DR

To the best of the knowledge, this is the first work that ensures clients' prompts and responses remain inaccessible to the cloud, while also preserving model privacy, performance, and efficiency.

Abstract

The increasing reliance on cloud-hosted Large Language Models (LLMs) exposes sensitive client data, such as prompts and responses, to potential privacy breaches by service providers. Existing approaches fail to ensure privacy, maintain model performance, and preserve computational efficiency simultaneously. To address this challenge, we propose Talaria, a confidential inference framework that partitions the LLM pipeline to protect client data without compromising the cloud's model intellectual property or inference quality. Talaria executes sensitive, weight-independent operations within a client-controlled Confidential Virtual Machine (CVM) while offloading weight-dependent computations to the cloud GPUs. The interaction between these environments is secured by our Reversible Masked Outsourcing (ReMO) protocol, which uses a hybrid masking technique to reversibly obscure intermediate data before outsourcing computations. Extensive evaluations show that Talaria can defend against state-of-the-art token inference attacks, reducing token reconstruction accuracy from over 97.5% to an average of 1.34%, all while being a lossless mechanism that guarantees output identical to the original model without significantly decreasing efficiency and scalability. To the best of our knowledge, this is the first work that ensures clients' prompts and responses remain inaccessible to the cloud, while also preserving model privacy, performance, and efficiency.
Paper Structure (43 sections, 1 theorem, 27 equations, 5 figures, 10 tables, 1 algorithm)

This paper contains 43 sections, 1 theorem, 27 equations, 5 figures, 10 tables, 1 algorithm.

Key Result

Theorem 1

Let $e=\mathrm{vec}(E)\in\mathbb{R}^{N}$ with $N=nd$. Suppose the mask has i.i.d. coordinates $m_i\sim\mathrm{Unif}[-\lambda/2,\lambda/2]$ and $\hat{e}_b=e_b+m$ for $b\in\{1,2\}$. Then the optimal adversary (with unbounded computation) that observes $\hat{e}$ satisfies

Figures (5)

  • Figure 1: Privacy leakage in cloud-based LLMs.
  • Figure 2: The left figure shows the existing CVM-based methods, which isolates the prompt in the CVM but exposes the response. The right figure shows our method, which isolates both prompt and response in CVM.
  • Figure 3: Using a Qwen qwen3technicalreport attention layer as an example, we split inference into two parts. The green box (weighted decoding) performs linear projections that apply model weights $W$ to inputs. The yellow box (structural decoding) comprises weight-free ops--RMSNorm ZhangS19a, attention score/softmax VaswaniSPUJGKP17, and residual connections. We run green-box ops on public GPUs and yellow-box ops on cloud CVMs to minimize exposure of confidential weights.
  • Figure 4: The overview of Talaria. The client first transmits the prompt to the CVM through a secure channel. For each new token generated, we use CPI for secure collaborative inference. When all tokens are generated, the CVM sends the response back to the client confidentially. All private data is isolated in the CVM and is invisible to the cloud.
  • Figure 5: Efficiency Evaluation.

Theorems & Definitions (2)

  • Definition 1: Computational indistinguishability
  • Theorem 1: Information-theoretic bound