FLASH-FHE: A Heterogeneous Architecture for Fully Homomorphic Encryption Acceleration
Junxue Zhang, Xiaodian Cheng, Gang Cao, Meng Dai, Yijun Sun, Han Tian, Dian Shen, Yong Wang, Kai Chen
TL;DR
Real-world FHE workloads are highly mixed, with deep tasks requiring bootstrapping and large $N$ and $L$ while shallow tasks benefit from small $N$ and $L$; naive homogeneous accelerators cannot meet both demands. FLASH-FHE introduces a heterogeneous architecture with bootstrappable and swift clusters organized into cluster affiliations, plus a shared on-chip cache and a multi-level transpose, guided by a scheduler that optimizes both parallelism and data reuse. The design includes an RTL implementation, silicon-scale synthesis at 7nm and 14/12nm, and evaluation showing average improvements of $1.4\times$ over CraterLake and $11.2\times$ over F1 for deep workloads, and up to $8.0\times$ for shallow workloads. This approach enables practical acceleration of mixed FHE workloads on sub-10 nm process nodes, reducing latency and energy for privacy-preserving computation.
Abstract
While many hardware accelerators have recently been proposed to address the inefficiency problem of fully homomorphic encryption (FHE) schemes, none of them is able to deliver optimal performance when facing real-world FHE workloads consisting of a mixture of shallow and deep computations, due primarily to their homogeneous design principle. This paper presents FLASH-FHE, the first FHE accelerator with a heterogeneous architecture for mixed workloads. At its heart, FLASH-FHE designs two types of computation clusters, ie, bootstrappable and swift, to optimize for deep and shallow workloads respectively in terms of cryptographic parameters and hardware pipelines. We organize one bootstrappable and two swift clusters into one cluster affiliation, and present a scheduling scheme that provides sufficient acceleration for deep FHE workloads by utilizing all the affiliations, while improving parallelism for shallow FHE workloads by assigning one shallow workload per affiliation and dynamically decomposing the bootstrappable cluster into multiple swift pipelines to accelerate the assigned workload. We further show that these two types of clusters can share valuable on-chip memory, improving performance without significant resource consumption. We implement FLASH-FHE with RTL and synthesize it using both 7nm and 14/12nm technology nodes, and our experiment results demonstrate that FLASH-FHE achieves an average performance improvement of $1.4\times$ and $11.2\times$ compared to state-of-the-art FHE accelerators CraterLake and F1 for deep workloads, while delivering up to $8.0\times$ speedup for shallow workloads due to its heterogeneous architecture.
