Flash: A Hybrid Private Inference Protocol for Deep CNNs with High Accuracy and Low Latency on CPU

Hyeri Roh; Jinsu Yeo; Yeongil Ko; Gu-Yeon Wei; David Brooks; Woo-Seok Choi

Flash: A Hybrid Private Inference Protocol for Deep CNNs with High Accuracy and Low Latency on CPU

Hyeri Roh, Jinsu Yeo, Yeongil Ko, Gu-Yeon Wei, David Brooks, Woo-Seok Choi

TL;DR

Flash tackles the bottleneck of private CNN inference by integrating a low-latency direct-encoded convolution (DRot) with a second-order polynomial activation $f(x)=x^2+x$, trained through a gradual layer-by-layer replacement to preserve accuracy. It couples HE-based linear computations with a 2PC-based, offline-free protocol for the activation, eliminating offline communication and delivering large reductions in online latency and total communication. The authors demonstrate up to 16–45x online latency improvements and 84–196x reduction in communication, achieving end-to-end private inference on CPU within minutes for CIFAR-100, TinyImageNet, and feasibly under 1 minute for ImageNet, with scalable gains through CPU multi-threading. The work provides a practical pathway to deploy privacy-preserving CNN inference at scale on conventional hardware, combining algorithmic innovations (DRot, direct encoding, and training with $x^2+x$) with a carefully designed hybrid cryptographic protocol. Overall, Flash significantly advances the practicality of private inference by removing offline communication obligations, reducing noise growth in rotations, and preserving accuracy while dramatically cutting latency and bandwidth compared with prior art.

Abstract

This paper presents Flash, an optimized private inference (PI) hybrid protocol utilizing both homomorphic encryption (HE) and secure two-party computation (2PC), which can reduce the end-to-end PI latency for deep CNN models less than 1 minute with CPU. To this end, first, Flash proposes a low-latency convolution algorithm built upon a fast slot rotation operation and a novel data encoding scheme, which results in 4-94x performance gain over the state-of-the-art. Second, to minimize the communication cost introduced by the standard nonlinear activation function ReLU, Flash replaces the entire ReLUs with the polynomial $x^2+x$ and trains deep CNN models with the new training strategy. The trained models improve the inference accuracy for CIFAR-10/100 and TinyImageNet by 16% on average (up to 40% for ResNet-32) compared to prior art. Last, Flash proposes an efficient 2PC-based $x^2+x$ evaluation protocol that does not require any offline communication and that reduces the total communication cost to process the activation layer by 84-196x over the state-of-the-art. As a result, the end-to-end PI latency of Flash implemented on CPU is 0.02 minute for CIFAR-100 and 0.57 minute for TinyImageNet classification, while the total data communication is 0.07GB for CIFAR-100 and 0.22GB for TinyImageNet. Flash improves the state-of-the-art PI by 16-45x in latency and 84-196x in communication cost. Moreover, even for ImageNet, Flash can deliver the latency less than 1 minute on CPU with the total communication less than 1GB.

Flash: A Hybrid Private Inference Protocol for Deep CNNs with High Accuracy and Low Latency on CPU

TL;DR

Flash tackles the bottleneck of private CNN inference by integrating a low-latency direct-encoded convolution (DRot) with a second-order polynomial activation

, trained through a gradual layer-by-layer replacement to preserve accuracy. It couples HE-based linear computations with a 2PC-based, offline-free protocol for the activation, eliminating offline communication and delivering large reductions in online latency and total communication. The authors demonstrate up to 16–45x online latency improvements and 84–196x reduction in communication, achieving end-to-end private inference on CPU within minutes for CIFAR-100, TinyImageNet, and feasibly under 1 minute for ImageNet, with scalable gains through CPU multi-threading. The work provides a practical pathway to deploy privacy-preserving CNN inference at scale on conventional hardware, combining algorithmic innovations (DRot, direct encoding, and training with

) with a carefully designed hybrid cryptographic protocol. Overall, Flash significantly advances the practicality of private inference by removing offline communication obligations, reducing noise growth in rotations, and preserving accuracy while dramatically cutting latency and bandwidth compared with prior art.

Abstract

and trains deep CNN models with the new training strategy. The trained models improve the inference accuracy for CIFAR-10/100 and TinyImageNet by 16% on average (up to 40% for ResNet-32) compared to prior art. Last, Flash proposes an efficient 2PC-based

evaluation protocol that does not require any offline communication and that reduces the total communication cost to process the activation layer by 84-196x over the state-of-the-art. As a result, the end-to-end PI latency of Flash implemented on CPU is 0.02 minute for CIFAR-100 and 0.57 minute for TinyImageNet classification, while the total data communication is 0.07GB for CIFAR-100 and 0.22GB for TinyImageNet. Flash improves the state-of-the-art PI by 16-45x in latency and 84-196x in communication cost. Moreover, even for ImageNet, Flash can deliver the latency less than 1 minute on CPU with the total communication less than 1GB.

Paper Structure (30 sections, 10 equations, 11 figures, 10 tables, 2 algorithms)

This paper contains 30 sections, 10 equations, 11 figures, 10 tables, 2 algorithms.

Introduction
Background
Threat Model
Homomorphic Encryption
Additve Secret Sharing
Garbled Circuits
Beaver's Triples
Existing PI Protocols
Convolution with Direct Encoding
Conventional Convolution
Proposed Slot Rotation over Encrypted Data
Proposed Convolution with DRot
Training with Polynomial Activation
Secure Polynomial Activation Evaluation
Secure Polynomial Evaluation with Existing Techniques
...and 15 more sections

Figures (11)

Figure 1: System design for PI protocol: (a) HE-based, (b) 2PC-based, and (c) overview of Flash with this paper's organization.
Figure 2: Latency comparison between HE operations. Encryption parameters are chosen to compute convolutions in VGG-16 for ImageNet, and HRot latency varies with decomposition base.
Figure 3: Conventional multi-channel convolution. Note that superscripts indicate the order of input channels: (a) single convolution process with two channels packed, and (b) channel-rotation to add partial sums.
Figure 4: Proposed multi-channel convolution. Superscripts indicate the order of input channels and # denotes slots occupied with dummy data: (a) single-channel convolution process with a ciphertext packing two input channels, (b) channel-rotation to add partial sums, and (c) comparison of latency, output ciphertext size, and remaining noise budget between conventional and proposed convolution across various parameter sets ($H\times W$, $c_i$, $c_o$) with 3$\times$3 kernels.
Figure 5: Second-order polynomial approximation of ReLU.
...and 6 more figures

Flash: A Hybrid Private Inference Protocol for Deep CNNs with High Accuracy and Low Latency on CPU

TL;DR

Abstract

Flash: A Hybrid Private Inference Protocol for Deep CNNs with High Accuracy and Low Latency on CPU

Authors

TL;DR

Abstract

Table of Contents

Figures (11)