Table of Contents
Fetching ...

PrivQuant: Communication-Efficient Private Inference with Quantized Network/Protocol Co-Optimization

Tianshi Xu, Shuzhang Zhong, Wenxuan Zeng, Runsheng Wang, Meng Li

TL;DR

This paper proposes PrivQuant, a framework that jointly optimizes the 2PC-based quantized inference protocols and the network quantization algorithm, enabling communication-efficient private inference and develops a communicationaware mixed precision quantization algorithm to improve the inference efficiency while maintaining high accuracy.

Abstract

Private deep neural network (DNN) inference based on secure two-party computation (2PC) enables secure privacy protection for both the server and the client. However, existing secure 2PC frameworks suffer from a high inference latency due to enormous communication. As the communication of both linear and non-linear DNN layers reduces with the bit widths of weight and activation, in this paper, we propose PrivQuant, a framework that jointly optimizes the 2PC-based quantized inference protocols and the network quantization algorithm, enabling communication-efficient private inference. PrivQuant proposes DNN architecture-aware optimizations for the 2PC protocols for communication-intensive quantized operators and conducts graph-level operator fusion for communication reduction. Moreover, PrivQuant also develops a communication-aware mixed precision quantization algorithm to improve inference efficiency while maintaining high accuracy. The network/protocol co-optimization enables PrivQuant to outperform prior-art 2PC frameworks. With extensive experiments, we demonstrate PrivQuant reduces communication by $11\times, 2.5\times \mathrm{and}~ 2.8\times$, which results in $8.7\times, 1.8\times ~ \mathrm{and}~ 2.4\times$ latency reduction compared with SiRNN, COINN, and CoPriv, respectively.

PrivQuant: Communication-Efficient Private Inference with Quantized Network/Protocol Co-Optimization

TL;DR

This paper proposes PrivQuant, a framework that jointly optimizes the 2PC-based quantized inference protocols and the network quantization algorithm, enabling communication-efficient private inference and develops a communicationaware mixed precision quantization algorithm to improve the inference efficiency while maintaining high accuracy.

Abstract

Private deep neural network (DNN) inference based on secure two-party computation (2PC) enables secure privacy protection for both the server and the client. However, existing secure 2PC frameworks suffer from a high inference latency due to enormous communication. As the communication of both linear and non-linear DNN layers reduces with the bit widths of weight and activation, in this paper, we propose PrivQuant, a framework that jointly optimizes the 2PC-based quantized inference protocols and the network quantization algorithm, enabling communication-efficient private inference. PrivQuant proposes DNN architecture-aware optimizations for the 2PC protocols for communication-intensive quantized operators and conducts graph-level operator fusion for communication reduction. Moreover, PrivQuant also develops a communication-aware mixed precision quantization algorithm to improve inference efficiency while maintaining high accuracy. The network/protocol co-optimization enables PrivQuant to outperform prior-art 2PC frameworks. With extensive experiments, we demonstrate PrivQuant reduces communication by , which results in latency reduction compared with SiRNN, COINN, and CoPriv, respectively.

Paper Structure

This paper contains 32 sections, 3 theorems, 8 equations, 10 figures, 12 tables, 1 algorithm.

Key Result

Proposition 4.1

For a given $\langle x \rangle^{(l_1)}$, $\Pi_{\mathrm{Trunc}}^{l_1,l_2}(\langle x \rangle^{(l_1)})$ can be decomposed into $\Pi_{\mathrm{TR}}^{l_1,l_2}$ followed by $\Pi_{\mathrm{Ext}}^{l_1-l_2,l_1}$ as The decomposition reduce the communication from $\mathrm{O}(\lambda (l_1 + 3))$ to $\mathrm{O}(\lambda (l_1 + 2))$.

Figures (10)

  • Figure 1: Profile the ResNet50 building block with representative 2PC protocols, i.e., CrypTFlow2 (first column) and SiRNN (other columns): the scaling and breakdown of (a) total communication and (b) online communication with different bit-widths of weight and activation.
  • Figure 2: Detailed protocols for one convolution with residual connection in (a) CrypTFlow2 and (b) SiRNN. The bit extension, truncation, and re-quantization are required in SiRNN to align the bit-widths and scales of quantized operands.
  • Figure 3: Overview of PrivQuant.
  • Figure 4: An illustration of OT-based matrix multiplication protocol which extends $X$ and chooses the client to be the sender. We omit $\langle\cdot \rangle$ for simplicity.
  • Figure 5: (a) The baseline protocol for the residual addition in SiRNN; and (b) our proposed simplified protocol. The $l\_,s\_$ means the bit-width and scale of the activations.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Proposition 4.1
  • Proposition 4.2
  • Proposition 4.3