Table of Contents
Fetching ...

Theodosian: A Deep Dive into Memory-Hierarchy-Centric FHE Acceleration

Wonseok Choi, Hyunah Yu, Jongmin Kim, Hyesung Ji, Jaiyoung Park, Jung Ho Ahn

TL;DR

The paper conducts a microarchitectural analysis of CKKS bootstrapping on modern GPUs, revealing that memory bandwidth and L2-cache capacity, not arithmetic throughput, dominate performance. It introduces Theodosian, a set of memory-hierarchy-aware optimizations (L2-aware batching, complementary pipelining, and CUDA Graphs) that improve CKKS throughput and bootstrapping latency, achieving 1.45–1.83x speedups on RTX 5090 and setting new state-of-the-art results. The study also shows that even with large L2 caches, the memory wall persists, and outlines remaining headroom and future directions toward memory-aware cryptographic algorithms and hardware designs. Overall, the work provides a practical pathway to accelerate FHE on GPUs while highlighting fundamental hardware constraints.

Abstract

Fully homomorphic encryption (FHE) enables secure computation on encrypted data, mitigating privacy concerns in cloud and edge environments. However, due to its high compute and memory demands, extensive acceleration research has been pursued across diverse hardware platforms, especially GPUs. In this paper, we perform a microarchitectural analysis of CKKS, a popular FHE scheme, on modern GPUs. We focus on on-chip cache behavior, and show that the dominant kernels remain bound by memory bandwidth despite a high-bandwidth L2 cache, exposing a persistent memory wall. We further discover that the overall CKKS pipeline throughput is constrained by low per-kernel hardware utilization, caused by insufficient intra-kernel parallelism. Motivated by these findings, we introduce Theodosian, a set of complementary, memory-aware optimizations that improve cache efficiency and reduce runtime overheads. Our approach delivers consistent speedups across various CKKS workloads. On an RTX 5090, we reduce the bootstrapping latency for 32,768 complex numbers to 15.2ms with Theodosian, and further to 12.8ms with additional algorithmic optimizations, establishing new state-of-the-art GPU performance to the best of our knowledge.

Theodosian: A Deep Dive into Memory-Hierarchy-Centric FHE Acceleration

TL;DR

The paper conducts a microarchitectural analysis of CKKS bootstrapping on modern GPUs, revealing that memory bandwidth and L2-cache capacity, not arithmetic throughput, dominate performance. It introduces Theodosian, a set of memory-hierarchy-aware optimizations (L2-aware batching, complementary pipelining, and CUDA Graphs) that improve CKKS throughput and bootstrapping latency, achieving 1.45–1.83x speedups on RTX 5090 and setting new state-of-the-art results. The study also shows that even with large L2 caches, the memory wall persists, and outlines remaining headroom and future directions toward memory-aware cryptographic algorithms and hardware designs. Overall, the work provides a practical pathway to accelerate FHE on GPUs while highlighting fundamental hardware constraints.

Abstract

Fully homomorphic encryption (FHE) enables secure computation on encrypted data, mitigating privacy concerns in cloud and edge environments. However, due to its high compute and memory demands, extensive acceleration research has been pursued across diverse hardware platforms, especially GPUs. In this paper, we perform a microarchitectural analysis of CKKS, a popular FHE scheme, on modern GPUs. We focus on on-chip cache behavior, and show that the dominant kernels remain bound by memory bandwidth despite a high-bandwidth L2 cache, exposing a persistent memory wall. We further discover that the overall CKKS pipeline throughput is constrained by low per-kernel hardware utilization, caused by insufficient intra-kernel parallelism. Motivated by these findings, we introduce Theodosian, a set of complementary, memory-aware optimizations that improve cache efficiency and reduce runtime overheads. Our approach delivers consistent speedups across various CKKS workloads. On an RTX 5090, we reduce the bootstrapping latency for 32,768 complex numbers to 15.2ms with Theodosian, and further to 12.8ms with additional algorithmic optimizations, establishing new state-of-the-art GPU performance to the best of our knowledge.

Paper Structure

This paper contains 29 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Comparing (a) L2 cache capacity, (b) empirical memory bandwidth, and empirical 32-bit integer multiply-and-add (IMAD) throughput across various NVIDIA GPUs. L2 bandwidth is measured by repeatedly reading and writing a data block smaller than the L2 cache size to saturate its bandwidth.
  • Figure 2: (a) Utilization rates of ALU, FMA, SM-to-L2 NoC, L2-to-SM NoC, and aggregated L2 cache bandwidth for NTT kernels with $L$ varying from 10 to 250, and for BConv kernels with $L_{\mathrm{in}}=12$ and $L_{\mathrm{out}}$ varying from 12 to 264. We look beyond the typical $L$ range for a single polynomial ($L<64$), which will be made clear in §\ref{['sec:opt:parallel']}. We compare the original BConv kernel in Cheddar asplos-2026-cheddar with our optimized one. (b) FMA instruction breakdown into core (defined in §\ref{['sec:boot:motivation']}) and non-core operations for the kernels, where BConv(C) and BConv(O) denote Cheddar’s original BConv and our optimized BConv, respectively. (c) Kernel execution time changes when the core operations are removed for $L = 48$ (NTT) and $(L_\mathrm{out}, L_\mathrm{in})=(48, 12)$ (BConv). An RTX 5090 is used for the analysis.
  • Figure 3: (Left) Roofline analysis of bootstrapping (Boot) and polynomial operations based on Cheddar asplos-2026-cheddar. (Right) Time, global memory access, and core IMAD operation breakdown for bootstrapping. An RTX 5090 is used for the analysis. In addition to the operations, Boot includes kernel launch overhead.
  • Figure 4: SM-to-L2 NoC utilization, FMA utilization, and the amount of DRAM transfers in NTT1, NTT2, and BConv kernels for batch sizes ranging from 1 to 16. Tested parameters are $L=24, 48$ for NTT and $(L_{\mathrm{out}},L_{\mathrm{in}})=(24,12),(48,12)$ for BConv. An RTX 5090 is used for the analysis.
  • Figure 5: How the operational sequence of key-switching changes when multi-polynomial caching and complementary pipelining are applied. $(C, D)$ denotes $C$ polynomials each with $D$ limbs. Typical values for the parameters in the figure are $L=48$, $\alpha=12$, and $\beta=L/\alpha=4$.
  • ...and 3 more figures