FlexMem: High-Parallel Near-Memory Architecture for Flexible Dataflow in Fully Homomorphic Encryption
Shangyi Shi, Husheng Han, Jianan Mu, Xinyao Zheng, Ling Liang, Hang Lu, Zidong Du, Xiaowei Li, Xing Hu, Qi Guo
TL;DR
FlexMem addresses the memory bottleneck in Fully Homomorphic Encryption by placing high-parallel, homogeneous PEs near DRAM subarrays and enabling hierarchical, stride-aware data movements. The architecture supports polynomial- and ciphertext-level dataflows with remapping to maintain near-full near-memory bandwidth utilization across CKKS, TFHE, and hybrid workloads. Key innovations include bank- and chip-level interconnects tailored to NTT and bootstrapping patterns, plus in-memory dataflow management that minimizes host CPU involvement. Experimental results show substantial improvements over state-of-the-art accelerators, including higher bandwidth utilization (~95.7%) and notable speedups across CKKS and TFHE benchmarks, validating FlexMem’s scalable near-memory design for secure outsourced computation.
Abstract
Fully Homomorphic Encryption (FHE) imposes substantial memory bandwidth demands, presenting significant challenges for efficient hardware acceleration. Near-memory Processing (NMP) has emerged as a promising architectural solution to alleviate the memory bottleneck. However, the irregular memory access patterns and flexible dataflows inherent to FHE limit the effectiveness of existing NMP accelerators, which fail to fully utilize the available near-memory bandwidth. In this work, we propose FlexMem, a near-memory accelerator featuring high-parallel computational units with varying memory access strides and interconnect topologies to effectively handle irregular memory access patterns. Furthermore, we design polynomial and ciphertext-level dataflows to efficiently utilize near-memory bandwidth under varying degrees of polynomial parallelism and enhance parallel performance. Experimental results demonstrate that FlexMem achieves 1.12 times of performance improvement over state-of-the-art near-memory architectures, with 95.7% of near-memory bandwidth utilization.
