GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption

Kaustubh Shivdikar; Yuhui Bao; Rashmi Agrawal; Michael Shen; Gilbert Jonatan; Evelio Mora; Alexander Ingare; Neal Livesay; José L. Abellán; John Kim; Ajay Joshi; David Kaeli

GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption

Kaustubh Shivdikar, Yuhui Bao, Rashmi Agrawal, Michael Shen, Gilbert Jonatan, Evelio Mora, Alexander Ingare, Neal Livesay, José L. Abellán, John Kim, Ajay Joshi, David Kaeli

TL;DR

Fully Homomorphic Encryption enables computation on encrypted data but remains prohibitively slow for practical use. The paper introduces GME, a GPU-based microarchitectural co-design for CKKS on AMD CDNA GPUs that adds cNoC, MOD units, WMAC units, and LABS with a compile-time optimization. Using NaviSim and BlockSim, GME achieves substantial speedups over CPU, GPU, and FPGA baselines, and reduces memory traffic through data locality and native modular arithmetic. The work demonstrates a practical path to scaling privacy-preserving workloads in the cloud by leveraging existing GPU ecosystems, and provides a design-space exploration toolchain to guide future microarchitecture choices.

Abstract

Fully Homomorphic Encryption (FHE) enables the processing of encrypted data without decrypting it. FHE has garnered significant attention over the past decade as it supports secure outsourcing of data processing to remote cloud services. Despite its promise of strong data privacy and security guarantees, FHE introduces a slowdown of up to five orders of magnitude as compared to the same computation using plaintext data. This overhead is presently a major barrier to the commercial adoption of FHE. In this work, we leverage GPUs to accelerate FHE, capitalizing on a well-established GPU ecosystem available in the cloud. We propose GME, which combines three key microarchitectural extensions along with a compile-time optimization to the current AMD CDNA GPU architecture. First, GME integrates a lightweight on-chip compute unit (CU)-side hierarchical interconnect to retain ciphertext in cache across FHE kernels, thus eliminating redundant memory transactions. Second, to tackle compute bottlenecks, GME introduces special MOD-units that provide native custom hardware support for modular reduction operations, one of the most commonly executed sets of operations in FHE. Third, by integrating the MOD-unit with our novel pipelined $64$-bit integer arithmetic cores (WMAC-units), GME further accelerates FHE workloads by $19\%$. Finally, we propose a Locality-Aware Block Scheduler (LABS) that exploits the temporal locality available in FHE primitive blocks. Incorporating these microarchitectural features and compiler optimizations, we create a synergistic approach achieving average speedups of $796\times$, $14.2\times$, and $2.3\times$ over Intel Xeon CPU, NVIDIA V100 GPU, and Xilinx FPGA implementations, respectively.

GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption

TL;DR

Abstract

-bit integer arithmetic cores (WMAC-units), GME further accelerates FHE workloads by

. Finally, we propose a Locality-Aware Block Scheduler (LABS) that exploits the temporal locality available in FHE primitive blocks. Incorporating these microarchitectural features and compiler optimizations, we create a synergistic approach achieving average speedups of

, and

over Intel Xeon CPU, NVIDIA V100 GPU, and Xilinx FPGA implementations, respectively.

Paper Structure (15 sections, 3 equations, 8 figures, 9 tables)

This paper contains 15 sections, 3 equations, 8 figures, 9 tables.

Introduction
Background
AMD CDNA Architecture
CKKS FHE Scheme
GME Architecture
cNoC: CU-side interconnect
Enhancing the Vector ALU
LABS: Locality-Aware Block Scheduler
Evaluation
The NaviSim and BlockSim Simulators
Experimental Setup
Results
Discussion
Related Work
Conclusion

Figures (8)

Figure 1: FHE offers a safeguard against online eavesdroppers as well as untrusted cloud services by allowing direct computation on encrypted data.
Figure 2: The four key contributions of our work (indicated in green) evaluated within the context of an AMD CDNA GPU architecture.
Figure 3: Architecture diagram showing the limitations of AMD GPU memory hierarchy. Each compute unit has a dedicated L1V cache and an LDS unit that cannot be shared with neighboring compute units.
Figure 4: Inter-CU communication: Traditional vs proposed communication with on-chip network
Figure 5: Proposed hierarchical on-chip network featuring a concentrated 2D torus topology
...and 3 more figures

GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption

TL;DR

Abstract

GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption

Authors

TL;DR

Abstract

Table of Contents

Figures (8)