Table of Contents
Fetching ...

PUMA: Secure Inference of LLaMA-7B in Five Minutes

Ye Dong, Wen-jie Lu, Yancheng Zheng, Haoqi Wu, Derun Zhao, Jin Tan, Zhicong Huang, Cheng Hong, Tao Wei, Wenguang Chen

TL;DR

PUMA tackles privacy concerns in Transformer inference by delivering an end-to-end secure MPC framework that preserves plaintext-like accuracy without retraining. It introduces high-fidelity approximations for $\mathrm{GeLU}$ and $\mathrm{softmax}$, plus secure Embedding and LayerNorm, enabling loading of pre-trained plaintext models. Empirical results show up to $\sim$2× speedups over the prior MPCFormer baseline with comparable accuracy on several models, including secure evaluation of LLaMA-7B in five minutes per token. The work demonstrates practical privacy-preserving DLaaS progress and open-sources the implementation on SecretFlow-SPU, suggesting further gains from quantization and accelerators. Overall, Puma represents a significant step toward scalable, private inference for large Transformer models.

Abstract

With ChatGPT as a representative, tons of companies have began to provide services based on large Transformers models. However, using such a service inevitably leak users' prompts to the model provider. Previous studies have studied secure inference for Transformer models using secure multiparty computation (MPC), where model parameters and clients' prompts are kept secret. Despite this, these frameworks are still limited in terms of model performance, efficiency, and deployment. To address these limitations, we propose framework PUMA to enable fast and secure Transformer model inference. Our framework designs high quality approximations for expensive functions such as GeLU and softmax, and significantly reduce the cost of secure inference while preserving the model performance. Additionally, we design secure Embedding and LayerNorm procedures that faithfully implement the desired functionality without undermining the Transformer architecture. PUMA is about $2\times$ faster than the state-of-the-art framework MPCFORMER(ICLR 2023) and has similar accuracy as plaintext models without fine-tuning (which the previous works failed to achieve). PUMA can even evaluate LLaMA-7B in around 5 minutes to generate 1 token. To our best knowledge, this is the first time that a model with such a parameter size is able to be evaluated under MPC. PUMA has been open-sourced in the Github repository of SecretFlow-SPU.

PUMA: Secure Inference of LLaMA-7B in Five Minutes

TL;DR

PUMA tackles privacy concerns in Transformer inference by delivering an end-to-end secure MPC framework that preserves plaintext-like accuracy without retraining. It introduces high-fidelity approximations for and , plus secure Embedding and LayerNorm, enabling loading of pre-trained plaintext models. Empirical results show up to 2× speedups over the prior MPCFormer baseline with comparable accuracy on several models, including secure evaluation of LLaMA-7B in five minutes per token. The work demonstrates practical privacy-preserving DLaaS progress and open-sources the implementation on SecretFlow-SPU, suggesting further gains from quantization and accelerators. Overall, Puma represents a significant step toward scalable, private inference for large Transformer models.

Abstract

With ChatGPT as a representative, tons of companies have began to provide services based on large Transformers models. However, using such a service inevitably leak users' prompts to the model provider. Previous studies have studied secure inference for Transformer models using secure multiparty computation (MPC), where model parameters and clients' prompts are kept secret. Despite this, these frameworks are still limited in terms of model performance, efficiency, and deployment. To address these limitations, we propose framework PUMA to enable fast and secure Transformer model inference. Our framework designs high quality approximations for expensive functions such as GeLU and softmax, and significantly reduce the cost of secure inference while preserving the model performance. Additionally, we design secure Embedding and LayerNorm procedures that faithfully implement the desired functionality without undermining the Transformer architecture. PUMA is about faster than the state-of-the-art framework MPCFORMER(ICLR 2023) and has similar accuracy as plaintext models without fine-tuning (which the previous works failed to achieve). PUMA can even evaluate LLaMA-7B in around 5 minutes to generate 1 token. To our best knowledge, this is the first time that a model with such a parameter size is able to be evaluated under MPC. PUMA has been open-sourced in the Github repository of SecretFlow-SPU.
Paper Structure (20 sections, 4 equations, 2 figures, 6 tables, 3 algorithms)

This paper contains 20 sections, 4 equations, 2 figures, 6 tables, 3 algorithms.

Figures (2)

  • Figure 1: Runtime of GPT2-Base for generating different output tokens, the input length is of length $32$.
  • Figure 2: Outputs of LLaMA-7B in plaintext and Puma.