Table of Contents
Fetching ...

An Efficient Private GPT Never Autoregressively Decodes

Zhengyi Li, Yue Guan, Kang Yang, Yu Feng, Ning Liu, Yu Yu, Jingwen Leng, Minyi Guo

TL;DR

This paper tackles privacy-preserving GPT inference by reducing cryptographic workload through public decoding. It introduces P OST, which uses a public model to draft tokens and securely verifies them with the private model in a single decoding step, aided by a knowledge-distillation–based model alignment. The approach yields significant end-to-end speedups (roughly 2.1x to 6.0x) across multiple model pairs and tasks, while maintaining the same generation quality as standard secure decoding. The method is compatible with existing cryptographic protocols and benefits further from larger, better-aligned public models, offering practical scalability for privacy-preserving GPT deployment.

Abstract

The wide deployment of the generative pre-trained transformer (GPT) has raised privacy concerns for both clients and servers. While cryptographic primitives can be employed for secure GPT inference to protect the privacy of both parties, they introduce considerable performance overhead.To accelerate secure inference, this study proposes a public decoding and secure verification approach that utilizes public GPT models, motivated by the observation that securely decoding one and multiple tokens takes a similar latency. The client uses the public model to generate a set of tokens, which are then securely verified by the private model for acceptance. The efficiency of our approach depends on the acceptance ratio of tokens proposed by the public model, which we improve from two aspects: (1) a private sampling protocol optimized for cryptographic primitives and (2) model alignment using knowledge distillation. Our approach improves the efficiency of secure decoding while maintaining the same level of privacy and generation quality as standard secure decoding. Experiments demonstrate a $2.1\times \sim 6.0\times$ speedup compared to standard decoding across three pairs of public-private models and different network conditions.

An Efficient Private GPT Never Autoregressively Decodes

TL;DR

This paper tackles privacy-preserving GPT inference by reducing cryptographic workload through public decoding. It introduces P OST, which uses a public model to draft tokens and securely verifies them with the private model in a single decoding step, aided by a knowledge-distillation–based model alignment. The approach yields significant end-to-end speedups (roughly 2.1x to 6.0x) across multiple model pairs and tasks, while maintaining the same generation quality as standard secure decoding. The method is compatible with existing cryptographic protocols and benefits further from larger, better-aligned public models, offering practical scalability for privacy-preserving GPT deployment.

Abstract

The wide deployment of the generative pre-trained transformer (GPT) has raised privacy concerns for both clients and servers. While cryptographic primitives can be employed for secure GPT inference to protect the privacy of both parties, they introduce considerable performance overhead.To accelerate secure inference, this study proposes a public decoding and secure verification approach that utilizes public GPT models, motivated by the observation that securely decoding one and multiple tokens takes a similar latency. The client uses the public model to generate a set of tokens, which are then securely verified by the private model for acceptance. The efficiency of our approach depends on the acceptance ratio of tokens proposed by the public model, which we improve from two aspects: (1) a private sampling protocol optimized for cryptographic primitives and (2) model alignment using knowledge distillation. Our approach improves the efficiency of secure decoding while maintaining the same level of privacy and generation quality as standard secure decoding. Experiments demonstrate a speedup compared to standard decoding across three pairs of public-private models and different network conditions.

Paper Structure

This paper contains 48 sections, 6 equations, 9 figures, 2 tables, 2 algorithms.

Figures (9)

  • Figure 1: Latencies against different input lengths. The bandwidth and one-way delay are 1000 Mbps and 10 ms.
  • Figure 2: The latency breakdown of some layers. The second row of the x-axis ticks represents bandwidth and one-way latency. The bars corresponding to the same x-axis ticks illustrate input lengths 1, 2, 4, 8, and 16.
  • Figure 3: The overview of the public decoding and secure verification approach. $\langle {\cdot} \rangle$ indicates the data are encrypted during computing, such as using the secret sharing, and is only visible to the data owner.
  • Figure 4: The alignment efficiency of three pairs of models on the Spider task.
  • Figure 5: The end-to-end speedup across two network settings and three pairs of models. The curves illustrate the relationship between speedup and acceptance ratio for various draft lengths. Specific speedups for four selected tasks are marked on these curves.
  • ...and 4 more figures