Peregrine: One-Shot Fine-Tuning for FHE Inference of General Deep CNNs
Huaming Ling, Ying Wang, Si Chen, Junfeng Fan
TL;DR
The paper tackles privacy-preserving CNN inference under CKKS-FHE by addressing two bottlenecks: non-linear activation approximation and ciphertext-slot capacity. It proposes a single-stage fine-tuning (SFT) framework with PolyAct-RN, enabling low-degree ($d=4$) polynomial activations while maintaining accuracy, and a Generalized Interleaved Packing (GIP) scheme that supports virtually arbitrary spatial resolutions through adaptive packing and carefully designed homomorphic operators. The combination yields end-to-end FHE inference across diverse architectures, including ResNet, MobileNet, and YOLOv5, with competitive accuracy on CIFAR-10, ImageNet, and MS COCO, and a first demonstration of FHE-based YOLO inference. The results suggest practical privacy-preserving vision workflows, with runtimes on CPU that could be dramatically reduced with hardware acceleration, enabling secure inference for high-resolution images and complex detectors.
Abstract
We address two fundamental challenges in adapting general deep CNNs for FHE-based inference: approximating non-linear activations such as ReLU with low-degree polynomials while minimizing accuracy degradation, and overcoming the ciphertext capacity barrier that constrains high-resolution image processing on FHE inference. Our contributions are twofold: (1) a single-stage fine-tuning (SFT) strategy that directly converts pre-trained CNNs into FHE-friendly forms using low-degree polynomials, achieving competitive accuracy with minimal training overhead; and (2) a generalized interleaved packing (GIP) scheme that is compatible with feature maps of virtually arbitrary spatial resolutions, accompanied by a suite of carefully designed homomorphic operators that preserve the GIP-form encryption throughout computation. These advances enable efficient, end-to-end FHE inference across diverse CNN architectures. Experiments on CIFAR-10, ImageNet, and MS COCO demonstrate that the FHE-friendly CNNs obtained via our SFT strategy achieve accuracy comparable to baselines using ReLU or SiLU activations. Moreover, this work presents the first demonstration of FHE-based inference for YOLO architectures in object detection leveraging low-degree polynomial activations.
