CiFlow: Dataflow Analysis and Optimization of Key Switching for Homomorphic Encryption
Negar Neda, Austin Ebel, Benedict Reynwar, Brandon Reagen
TL;DR
This work tackles the memory bandwidth and data movement challenges in homomorphic encryption by rethinking the dataflow of the hybrid key-switching (HKS) step. It introduces three dataflows—Max-Parallel, Digit-Centric, and Output-Centric—and demonstrates that the Output-Centric approach greatly increases data reuse, reduces the on-chip working set, and lowers off-chip bandwidth. Through evaluation on a vector HE accelerator (RPU) across multiple CKKS-parameterized benchmarks, OC achieves up to 4.16x speedup over MP and can reduce on-chip SRAM by up to 12.25x when evks are streamed off-chip, with only modest performance penalties. The results show that careful dataflow design can substantially improve HE practicality by balancing bandwidth and compute, enabling more scalable, memory-efficient hardware solutions for private neural inference and other encrypted workloads.
Abstract
Homomorphic encryption (HE) is a privacy-preserving computation technique that enables computation on encrypted data. Today, the potential of HE remains largely unrealized as it is impractically slow, preventing it from being used in real applications. A major computational bottleneck in HE is the key-switching operation, accounting for approximately 70% of the overall HE execution time and involving a large amount of data for inputs, intermediates, and keys. Prior research has focused on hardware accelerators to improve HE performance, typically featuring large on-chip SRAMs and high off-chip bandwidth to deal with large scale data. In this paper, we present a novel approach to improve key-switching performance by rigorously analyzing its dataflow. Our primary goal is to optimize data reuse with limited on-chip memory to minimize off-chip data movement. We introduce three distinct dataflows: Max-Parallel (MP), Digit-Centric (DC), and Output-Centric (OC), each with unique scheduling approaches for key-switching computations. Through our analysis, we show how our proposed Output-Centric technique can effectively reuse data by significantly lowering the intermediate key-switching working set and alleviating the need for massive off-chip bandwidth. We thoroughly evaluate the three dataflows using the RPU, a recently published vector processor tailored for ring processing algorithms, which includes HE. This evaluation considers sweeps of bandwidth and computational throughput, and whether keys are buffered on-chip or streamed. With OC, we demonstrate up to 4.16x speedup over the MP dataflow and show how OC can save 12.25x on-chip SRAM by streaming keys for minimal performance penalty.
