Table of Contents
Fetching ...

CiFlow: Dataflow Analysis and Optimization of Key Switching for Homomorphic Encryption

Negar Neda, Austin Ebel, Benedict Reynwar, Brandon Reagen

TL;DR

This work tackles the memory bandwidth and data movement challenges in homomorphic encryption by rethinking the dataflow of the hybrid key-switching (HKS) step. It introduces three dataflows—Max-Parallel, Digit-Centric, and Output-Centric—and demonstrates that the Output-Centric approach greatly increases data reuse, reduces the on-chip working set, and lowers off-chip bandwidth. Through evaluation on a vector HE accelerator (RPU) across multiple CKKS-parameterized benchmarks, OC achieves up to 4.16x speedup over MP and can reduce on-chip SRAM by up to 12.25x when evks are streamed off-chip, with only modest performance penalties. The results show that careful dataflow design can substantially improve HE practicality by balancing bandwidth and compute, enabling more scalable, memory-efficient hardware solutions for private neural inference and other encrypted workloads.

Abstract

Homomorphic encryption (HE) is a privacy-preserving computation technique that enables computation on encrypted data. Today, the potential of HE remains largely unrealized as it is impractically slow, preventing it from being used in real applications. A major computational bottleneck in HE is the key-switching operation, accounting for approximately 70% of the overall HE execution time and involving a large amount of data for inputs, intermediates, and keys. Prior research has focused on hardware accelerators to improve HE performance, typically featuring large on-chip SRAMs and high off-chip bandwidth to deal with large scale data. In this paper, we present a novel approach to improve key-switching performance by rigorously analyzing its dataflow. Our primary goal is to optimize data reuse with limited on-chip memory to minimize off-chip data movement. We introduce three distinct dataflows: Max-Parallel (MP), Digit-Centric (DC), and Output-Centric (OC), each with unique scheduling approaches for key-switching computations. Through our analysis, we show how our proposed Output-Centric technique can effectively reuse data by significantly lowering the intermediate key-switching working set and alleviating the need for massive off-chip bandwidth. We thoroughly evaluate the three dataflows using the RPU, a recently published vector processor tailored for ring processing algorithms, which includes HE. This evaluation considers sweeps of bandwidth and computational throughput, and whether keys are buffered on-chip or streamed. With OC, we demonstrate up to 4.16x speedup over the MP dataflow and show how OC can save 12.25x on-chip SRAM by streaming keys for minimal performance penalty.

CiFlow: Dataflow Analysis and Optimization of Key Switching for Homomorphic Encryption

TL;DR

This work tackles the memory bandwidth and data movement challenges in homomorphic encryption by rethinking the dataflow of the hybrid key-switching (HKS) step. It introduces three dataflows—Max-Parallel, Digit-Centric, and Output-Centric—and demonstrates that the Output-Centric approach greatly increases data reuse, reduces the on-chip working set, and lowers off-chip bandwidth. Through evaluation on a vector HE accelerator (RPU) across multiple CKKS-parameterized benchmarks, OC achieves up to 4.16x speedup over MP and can reduce on-chip SRAM by up to 12.25x when evks are streamed off-chip, with only modest performance penalties. The results show that careful dataflow design can substantially improve HE practicality by balancing bandwidth and compute, enabling more scalable, memory-efficient hardware solutions for private neural inference and other encrypted workloads.

Abstract

Homomorphic encryption (HE) is a privacy-preserving computation technique that enables computation on encrypted data. Today, the potential of HE remains largely unrealized as it is impractically slow, preventing it from being used in real applications. A major computational bottleneck in HE is the key-switching operation, accounting for approximately 70% of the overall HE execution time and involving a large amount of data for inputs, intermediates, and keys. Prior research has focused on hardware accelerators to improve HE performance, typically featuring large on-chip SRAMs and high off-chip bandwidth to deal with large scale data. In this paper, we present a novel approach to improve key-switching performance by rigorously analyzing its dataflow. Our primary goal is to optimize data reuse with limited on-chip memory to minimize off-chip data movement. We introduce three distinct dataflows: Max-Parallel (MP), Digit-Centric (DC), and Output-Centric (OC), each with unique scheduling approaches for key-switching computations. Through our analysis, we show how our proposed Output-Centric technique can effectively reuse data by significantly lowering the intermediate key-switching working set and alleviating the need for massive off-chip bandwidth. We thoroughly evaluate the three dataflows using the RPU, a recently published vector processor tailored for ring processing algorithms, which includes HE. This evaluation considers sweeps of bandwidth and computational throughput, and whether keys are buffered on-chip or streamed. With OC, we demonstrate up to 4.16x speedup over the MP dataflow and show how OC can save 12.25x on-chip SRAM by streaming keys for minimal performance penalty.
Paper Structure (23 sections, 9 figures, 4 tables)

This paper contains 23 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Hybrid key-switching dataflow diagram for parameters, $\ell=33, \ \alpha=11, \ dnum=3.$
  • Figure 2: High-level $\textit{ModUp}$ timing diagrams for the three proposed dataflows.
  • Figure 3: Microarchitecture of the RPU.
  • Figure 4: Quantifying latency reduction for MP, DC, and OC by increasing DRAM bandwidth for the five given benchmarks.
  • Figure 5: HKS runtime for BTS3 with $\mathbf{evks}$ being off-chip.
  • ...and 4 more figures