Table of Contents
Fetching ...

FlexMem: High-Parallel Near-Memory Architecture for Flexible Dataflow in Fully Homomorphic Encryption

Shangyi Shi, Husheng Han, Jianan Mu, Xinyao Zheng, Ling Liang, Hang Lu, Zidong Du, Xiaowei Li, Xing Hu, Qi Guo

TL;DR

FlexMem addresses the memory bottleneck in Fully Homomorphic Encryption by placing high-parallel, homogeneous PEs near DRAM subarrays and enabling hierarchical, stride-aware data movements. The architecture supports polynomial- and ciphertext-level dataflows with remapping to maintain near-full near-memory bandwidth utilization across CKKS, TFHE, and hybrid workloads. Key innovations include bank- and chip-level interconnects tailored to NTT and bootstrapping patterns, plus in-memory dataflow management that minimizes host CPU involvement. Experimental results show substantial improvements over state-of-the-art accelerators, including higher bandwidth utilization (~95.7%) and notable speedups across CKKS and TFHE benchmarks, validating FlexMem’s scalable near-memory design for secure outsourced computation.

Abstract

Fully Homomorphic Encryption (FHE) imposes substantial memory bandwidth demands, presenting significant challenges for efficient hardware acceleration. Near-memory Processing (NMP) has emerged as a promising architectural solution to alleviate the memory bottleneck. However, the irregular memory access patterns and flexible dataflows inherent to FHE limit the effectiveness of existing NMP accelerators, which fail to fully utilize the available near-memory bandwidth. In this work, we propose FlexMem, a near-memory accelerator featuring high-parallel computational units with varying memory access strides and interconnect topologies to effectively handle irregular memory access patterns. Furthermore, we design polynomial and ciphertext-level dataflows to efficiently utilize near-memory bandwidth under varying degrees of polynomial parallelism and enhance parallel performance. Experimental results demonstrate that FlexMem achieves 1.12 times of performance improvement over state-of-the-art near-memory architectures, with 95.7% of near-memory bandwidth utilization.

FlexMem: High-Parallel Near-Memory Architecture for Flexible Dataflow in Fully Homomorphic Encryption

TL;DR

FlexMem addresses the memory bottleneck in Fully Homomorphic Encryption by placing high-parallel, homogeneous PEs near DRAM subarrays and enabling hierarchical, stride-aware data movements. The architecture supports polynomial- and ciphertext-level dataflows with remapping to maintain near-full near-memory bandwidth utilization across CKKS, TFHE, and hybrid workloads. Key innovations include bank- and chip-level interconnects tailored to NTT and bootstrapping patterns, plus in-memory dataflow management that minimizes host CPU involvement. Experimental results show substantial improvements over state-of-the-art accelerators, including higher bandwidth utilization (~95.7%) and notable speedups across CKKS and TFHE benchmarks, validating FlexMem’s scalable near-memory design for secure outsourced computation.

Abstract

Fully Homomorphic Encryption (FHE) imposes substantial memory bandwidth demands, presenting significant challenges for efficient hardware acceleration. Near-memory Processing (NMP) has emerged as a promising architectural solution to alleviate the memory bottleneck. However, the irregular memory access patterns and flexible dataflows inherent to FHE limit the effectiveness of existing NMP accelerators, which fail to fully utilize the available near-memory bandwidth. In this work, we propose FlexMem, a near-memory accelerator featuring high-parallel computational units with varying memory access strides and interconnect topologies to effectively handle irregular memory access patterns. Furthermore, we design polynomial and ciphertext-level dataflows to efficiently utilize near-memory bandwidth under varying degrees of polynomial parallelism and enhance parallel performance. Experimental results demonstrate that FlexMem achieves 1.12 times of performance improvement over state-of-the-art near-memory architectures, with 95.7% of near-memory bandwidth utilization.

Paper Structure

This paper contains 32 sections, 2 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: FHE exhibits varying strides of memory access within subarray rows (a) or across banks (b). Different memory access strides result in various pipelines and timing efficiency. A case of DRAM timing: tACT (row activating, 12 ns), tWR (row writing back, 8 ns), tRCD (row to column delay, 12 ns), tCCD (column to column delay, 4 ns), tPRE (row precharging, 12 ns).
  • Figure 2: Ciphertext-level computation process in CKKS.
  • Figure 3: Logical organization inside a chip.
  • Figure 4: Overall architecture. PEs are integrated between each subarray pair. Parallel pathways among banks are constructed for NTT inter-stage coefficient switching. Chip-level network is proposed for ciphertext-level data transfer.
  • Figure 5: Pipeline between adjacent subarrays.
  • ...and 7 more figures