FHECore: Rethinking GPU Microarchitecture for Fully Homomorphic Encryption

Lohit Daksha; Seyda Guzelhan; Kaustubh Shivdikar; Carlos Agulló Domingo; Óscar Vera Lopez; Gilbert Jonatan; Hubert Dymarkowski; Aymane El Jerari; José Cano; José L. Abellán; John Kim; David Kaeli; Ajay Joshi

FHECore: Rethinking GPU Microarchitecture for Fully Homomorphic Encryption

Lohit Daksha, Seyda Guzelhan, Kaustubh Shivdikar, Carlos Agulló Domingo, Óscar Vera Lopez, Gilbert Jonatan, Hubert Dymarkowski, Aymane El Jerari, José Cano, José L. Abellán, John Kim, David Kaeli, Ajay Joshi

TL;DR

The proposed FHECore is a specialized functional unit integrated directly into the GPU's Streaming Multiprocessor, motivated by a key insight: the two dominant contributors to latency, Number Theoretic Transform and Base Conversion, can be formulated as modulo-linear transformations.

Abstract

Fully Homomorphic Encryption (FHE) enables computation directly on encrypted data but incurs massive computational and memory overheads, often exceeding plaintext execution by several orders of magnitude. While custom ASIC accelerators can mitigate these costs, their long time-to-market and the rapid evolution of FHE algorithms threaten their long-term relevance. GPUs, by contrast, offer scalability, programmability, and widespread availability, making them an attractive platform for FHE. However, modern GPUs are increasingly specialized for machine learning workloads, emphasizing low-precision datatypes (e.g., INT$8$, FP$8$) that are fundamentally mismatched to the wide-precision modulo arithmetic required by FHE. Essentially, while GPUs offer ample parallelism, their functional units, like Tensor Cores, are not suited for wide-integer modulo arithmetic required by FHE schemes such as CKKS. Despite this constraint, researchers have attempted to map FHE primitives on Tensor Cores by segmenting wide integers into low-precision (INT$8$) chunks. To overcome these bottlenecks, we propose FHECore, a specialized functional unit integrated directly into the GPU's Streaming Multiprocessor. Our design is motivated by a key insight: the two dominant contributors to latency$-$Number Theoretic Transform and Base Conversion$-$can be formulated as modulo-linear transformations. This allows them to be mapped on a common hardware unit that natively supports wide-precision modulo-multiply-accumulate operations. Our simulations demonstrate that FHECore reduces dynamic instruction count by a geometric mean of $2.41\times$ for CKKS primitives and $1.96\times$ for end-to-end workloads. These reductions translate to performance speedups of $1.57\times$ and $2.12\times$, respectively$-$including a $50\%$ reduction in bootstrapping latency$-$all while inuring a modest $2.4\%$ area overhead.

FHECore: Rethinking GPU Microarchitecture for Fully Homomorphic Encryption

TL;DR

Abstract

, FP

) that are fundamentally mismatched to the wide-precision modulo arithmetic required by FHE. Essentially, while GPUs offer ample parallelism, their functional units, like Tensor Cores, are not suited for wide-integer modulo arithmetic required by FHE schemes such as CKKS. Despite this constraint, researchers have attempted to map FHE primitives on Tensor Cores by segmenting wide integers into low-precision (INT

) chunks. To overcome these bottlenecks, we propose FHECore, a specialized functional unit integrated directly into the GPU's Streaming Multiprocessor. Our design is motivated by a key insight: the two dominant contributors to latency

Number Theoretic Transform and Base Conversion

can be formulated as modulo-linear transformations. This allows them to be mapped on a common hardware unit that natively supports wide-precision modulo-multiply-accumulate operations. Our simulations demonstrate that FHECore reduces dynamic instruction count by a geometric mean of

for CKKS primitives and

for end-to-end workloads. These reductions translate to performance speedups of

and

, respectively

including a

reduction in bootstrapping latency

all while inuring a modest

area overhead.

Paper Structure (30 sections, 5 equations, 10 figures, 10 tables, 1 algorithm)

This paper contains 30 sections, 5 equations, 10 figures, 10 tables, 1 algorithm.

Introduction
Background
FHE Operations and Basics of CKKS
Number Theoretic Transform (NTT)
Base Conversion
NVIDIA A100 GPU Architecture
Motivation
Growing Mismatch Between GPU Datatypes and FHE Requirements
The Need for Coarser-Grained Execution in FHE
GPUs provide the most realistic path for FHE acceleration
FHECore
Overall System Architecture
Integration of FHECore
Microarchitecture
Dataflow Analysis
...and 15 more sections

Figures (10)

Figure 1: Latency decomposition of CKKS-based workloads on the A$100$ GPU. Here FIDESlib resolves the memory boundedness of CKKS workloads using established prior work mad, and then we shift our focus on the compute performance. Across bootstrapping, logistic-regression (LR), ResNet$20$, and BERT-Tiny; the NTT, INTT, and BaseConv steps together account for more than $70 \%$ of total runtime, identifying them as the dominant compute performance bottlenecks.
Figure 2: Compilation flow for the Ampere architecture Tensor Core's INT$8$ operations. High-level CUDA C++ APIs using mma intrinsics are first lowered to PTX (vISA) instructions, which are further lowered to SASS (mISA) instructions. The green highlighted sections sandwiched in the middle represent instructions that are executed by Tensor Cores, while the remaining instructions are handled by the load/store (LD/ST) units to move data within the register file and across the hierarchy.
Figure 3: FHECore (FC) is introduced as a functional unit alongside existing CUDA cores and tensor cores (TC) within each Streaming Multiprocessor. Like TC, it operates on values stored in the register file and does not directly interact with the caches or the main memory. The FHECore consists of a $2$D systolic array of processing elements (PEs), each supporting modulo multiply-and-accumulate operations. The PE depicted in this figure is to support the output-stationary dataflow, where partial products are accumulated within an internal register. The modulo reduction step is handled by a dedicated Barrett reduction pipeline. Here $\mu$ is the precomputed constant, and q is the modulus.
Figure 4: Traversal of data in operand-stationary vs. output-stationary dataflows on a miniaturized example $4\times4$ systolic array of FHECore PEs. Here, the red and blue inputs of each PE correspond to either entries or partial-sums of the matrices. Numbers inside each PE denote the cycle in which the PE receives red and blue operands for the first time. In the operand-stationary dataflow, only the blue operand advances each cycle, while the red operand must traverse the entire $6$-stage PE pipeline before forwarding a partial sum to the PE vertically below. In contrast, the output-stationary dataflow forwards both operands (red and blue) every cycle, allowing for a significantly faster modulo matrix multiplication.
Figure 5: Interface of FHECore with the GPU memory hierarchy. FHECore (FC) shares the same register file and memory hierarchy as CUDA and tensor cores (TC). Like tensor cores, FHECore interacts solely with the register file, fetching operands and writing results via register ports. This design also allows future memory-system enhancements, such as distributed shared memory (DSMEM) introduced in the Hopper architecture, to be utilized by FHECore.
...and 5 more figures

FHECore: Rethinking GPU Microarchitecture for Fully Homomorphic Encryption

TL;DR

Abstract

FHECore: Rethinking GPU Microarchitecture for Fully Homomorphic Encryption

Authors

TL;DR

Abstract

Table of Contents

Figures (10)