EQO: Exploring Ultra-Efficient Private Inference with Winograd-Based Protocol and Quantization Co-Optimization

Wenxuan Zeng; Tianshi Xu; Meng Li; Runsheng Wang

EQO: Exploring Ultra-Efficient Private Inference with Winograd-Based Protocol and Quantization Co-Optimization

Wenxuan Zeng, Tianshi Xu, Meng Li, Runsheng Wang

TL;DR

EQO tackles the high communication cost of private CNN inference by jointly optimizing OT-based 2PC protocols and neural network quantization around Winograd-based convolution. It introduces QWinConv, graph-level protocol fusion, a simplified residual protocol, and MSB-known optimizations to minimize online communication, complemented by a Hessian-based, mixed-precision quantization and a 2PC-friendly bit re-weighting strategy. Across CIFAR, Tiny-ImageNet, and ImageNet, EQO achieves up to tens of times reduction in communication with accuracy on par with or slightly higher than state-of-the-art baselines. This work demonstrates that careful protocol-level design combined with sensitivity-aware network quantization can dramatically improve the practicality of confidential inference in real-world CNN workloads.

Abstract

Private convolutional neural network (CNN) inference based on secure two-party computation (2PC) suffers from high communication and latency overhead, especially from convolution layers. In this paper, we propose EQO, a quantized 2PC inference framework that jointly optimizes the CNNs and 2PC protocols. EQO features a novel 2PC protocol that combines Winograd transformation with quantization for efficient convolution computation. However, we observe naively combining quantization and Winograd convolution is sub-optimal: Winograd transformations introduce extensive local additions and weight outliers that increase the quantization bit widths and require frequent bit width conversions with non-negligible communication overhead. Therefore, at the protocol level, we propose a series of optimizations for the 2PC inference graph to minimize the communication. At the network level, We develop a sensitivity-based mixed-precision quantization algorithm to optimize network accuracy given communication constraints. We further propose a 2PC-friendly bit re-weighting algorithm to accommodate weight outliers without increasing bit widths. With extensive experiments, EQO demonstrates 11.7x, 3.6x, and 6.3x communication reduction with 1.29%, 1.16%, and 1.29% higher accuracy compared to state-of-the-art frameworks SiRNN, COINN, and CoPriv, respectively.

EQO: Exploring Ultra-Efficient Private Inference with Winograd-Based Protocol and Quantization Co-Optimization

TL;DR

Abstract

Paper Structure (46 sections, 4 theorems, 12 equations, 16 figures, 7 tables, 1 algorithm)

This paper contains 46 sections, 4 theorems, 12 equations, 16 figures, 7 tables, 1 algorithm.

Introduction
Preliminaries
Oblivious Transfer (OT)-based Linear Layers
Related Works
Motivations and Overview
Observation 1: the total communication of OT-based 2PC is determined by both the bit widths and the number of multiplications in linear layers.
Observation 2: a naive combination of Winograd transformations and quantization provides limited communication reduction.
Observation 3: Winograd transformations introduce quantization outliers and make low-precision weight quantization challenging.
Overview of EQO
EQO Protocol Optimization
Quantized Winograd Convolution Protocol
Graph-level Protocol Optimization
Graph-level Protocol Fusion
Simplified Residual Protocol
MSB-known Optimization
...and 31 more sections

Key Result

lemma thmcounterlemma

Consider $Y = XB$. For each element $Y_{i, j}$, its magnitude can always be bounded by where $||\cdot||_1$ is the $\ell_1$-norm.

Figures (16)

Figure 1: (a) Total communication and (b) online communication breakdown on the ResNet-50 building block profiled with CrypTFlow2 (CTF2) rathee2020cryptflow2 (uniform 37-bit) in the first column and SiRNN rathee2021sirnn (supporting mixed precision) in the next four columns; (c) weight distribution in regular and Winograd convolution, (d) the ratio of (max-average) and standard deviation indicates weight outliers consistently exist after Winograd transformation across different layers.
Figure 2: (a) Flow of OT-based linear layer, e.g., GEMM, including a pre-processing stage to generate input-independent helper data and an online stage to process client's input. (b) An illustration of a GEMM $Y=WX$.
Figure 3: Insertion of bit extension ($\textcolor{Goldenrod}{\bigstar}$) with quantized inference. There is a re-quantization in the GEMM protocol.
Figure 4: Communication breakdown of Winograd convolution on ResNet-32. After graph-level protocol fusion introduced in Section \ref{['sec:fusion']}, the communication of both feature and output transformation are reduced.
Figure 5: Example of bit re-weighting with adjusted representation range.
...and 11 more figures

Theorems & Definitions (4)

lemma thmcounterlemma
proposition thmcounterproposition
proposition thmcounterproposition
proposition thmcounterproposition

EQO: Exploring Ultra-Efficient Private Inference with Winograd-Based Protocol and Quantization Co-Optimization

TL;DR

Abstract

EQO: Exploring Ultra-Efficient Private Inference with Winograd-Based Protocol and Quantization Co-Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (4)