REED: Chiplet-Based Accelerator for Fully Homomorphic Encryption

Aikata Aikata; Ahmet Can Mert; Sunmin Kwon; Maxim Deryabin; Sujoy Sinha Roy

REED: Chiplet-Based Accelerator for Fully Homomorphic Encryption

Aikata Aikata, Ahmet Can Mert, Sunmin Kwon, Maxim Deryabin, Sujoy Sinha Roy

TL;DR

Fully Homomorphic Encryption faces massive computation and memory overhead, challenging practical deployment with monolithic ASIC accelerators. REED proposes a scalable 4-chiplet 2.5D FHE accelerator using a ring-based non-blocking C2C interconnect, a Hybrid NTT, MAS/AUT blocks, and PRNG-based KeySwitch key generation to match large monolithic performance while improving yield and cost. It demonstrates encrypted DNN training benchmarks and reports up to $2{,}991\times$ CPU speedups and $1.9\times$ better performance with roughly half the development cost versus state-of-the-art monolithic designs, thanks to high off-chip bandwidth via HBMs and efficient chiplet collaboration. The work shows that chiplet-based FHE accelerators can make privacy-preserving ML broadly practical, with clear paths to higher throughput and 3D integration in future work.

Abstract

Fully Homomorphic Encryption (FHE) enables privacy-preserving computation and has many applications. However, its practical implementation faces massive computation and memory overheads. To address this bottleneck, several Application-Specific Integrated Circuit (ASIC) FHE accelerators have been proposed. All these prior works put every component needed for FHE onto one chip (monolithic), hence offering high performance. However, they suffer from practical problems associated with large-scale chip design, such as inflexibility, low yield, and high manufacturing cost. In this paper, we present the first-of-its-kind multi-chiplet-based FHE accelerator `REED' for overcoming the limitations of prior monolithic designs. To utilize the advantages of multi-chiplet structures while matching the performance of larger monolithic systems, we propose and implement several novel strategies in the context of FHE. These include a scalable chiplet design approach, an effective framework for workload distribution, a custom inter-chiplet communication strategy, and advanced pipelined Number Theoretic Transform and automorphism design to enhance performance. Experimental results demonstrate that REED 2.5D microprocessor consumes 96.7 mm$^2$ chip area, 49.4 W average power in 7nm technology. It could achieve a remarkable speedup of up to 2,991x compared to a CPU (24-core 2xIntel X5690) and offer 1.9x better performance, along with a 50% reduction in development costs when compared to state-of-the-art ASIC FHE accelerators. Furthermore, our work presents the first instance of benchmarking an encrypted deep neural network (DNN) training. Overall, the REED architecture design offers a highly effective solution for accelerating FHE, thereby significantly advancing the practicality and deployability of FHE in real-world applications.

REED: Chiplet-Based Accelerator for Fully Homomorphic Encryption

TL;DR

CPU speedups and

better performance with roughly half the development cost versus state-of-the-art monolithic designs, thanks to high off-chip bandwidth via HBMs and efficient chiplet collaboration. The work shows that chiplet-based FHE accelerators can make privacy-preserving ML broadly practical, with clear paths to higher throughput and 3D integration in future work.

Abstract

chip area, 49.4 W average power in 7nm technology. It could achieve a remarkable speedup of up to 2,991x compared to a CPU (24-core 2xIntel X5690) and offer 1.9x better performance, along with a 50% reduction in development costs when compared to state-of-the-art ASIC FHE accelerators. Furthermore, our work presents the first instance of benchmarking an encrypted deep neural network (DNN) training. Overall, the REED architecture design offers a highly effective solution for accelerating FHE, thereby significantly advancing the practicality and deployability of FHE in real-world applications.

Paper Structure (33 sections, 22 figures, 8 tables, 6 algorithms)

This paper contains 33 sections, 22 figures, 8 tables, 6 algorithms.

Introduction
Background
FHE schemes and CKKS routines
FHE Hardware design goals
Monolithic vs Chiplet packaging
NTT Design Techniques
FHE-tailored Multi-Chiplet Design
REED 2.5D Architecture
Architecture Design of One Chiplet
The Hybrid NTT (Frankenstein's Approach)
Multiply-Add-Subtract (MAS) and Automorphism (AUT)
PRNG-Based Partial Key-Switching Key Generation
Programmable Instruction-Set Architecture
REED Processing Unit (PU)
Throughput Computation for KeySwitch
...and 18 more sections

Figures (22)

Figure 1: Hierarchical (4-step) NTT datapath for $N=16$. DIT stands for Decimation in Time, and DIF stands for Decimation in Frequency.
Figure 2: KeySwitch operation for $l=2$, where I, F, and K represent INTT, NTT, and key multiplication operations using MAS, respectively.
Figure 3: The diagram depicts the different techniques, data, and task distribution for automorphism followed by KeySwitch. $l=2,1$ for (a), (c) and $4$ for (b)
Figure 4: Side and top view of proposed four chiplet-based REED 2.5D.
Figure 5: Novel routing-friendly Hybrid NTT/INTT design flow for $N=N_1 \times N_2$.
...and 17 more figures

REED: Chiplet-Based Accelerator for Fully Homomorphic Encryption

TL;DR

Abstract

REED: Chiplet-Based Accelerator for Fully Homomorphic Encryption

Authors

TL;DR

Abstract

Table of Contents

Figures (22)