Table of Contents
Fetching ...

EFFACT: A Highly Efficient Full-Stack FHE Acceleration Platform

Yi Huang, Xinsheng Gong, Xiangyu Kong, Dibei Chen, Jianfeng Zhu, Wenping Zhu, Liangwei Li, Mingyu Gao, Shaojun Wei, Aoyang Zhang, Leibo Liu

TL;DR

EFFACT addresses the high data movement and compute demands of fully homomorphic encryption by delivering a full-stack acceleration platform that integrates a generalized ISA, compiler, and vector-friendly hardware. It analyzes real-world FHE workloads to re-balance compute across residue-polynomial operations, eliminates costly BConv bottlenecks, and introduces streaming memory access along with circuit-level resource reuse. The platform demonstrates substantial gains over state-of-the-art accelerators on CKKS bootstrapping, HELR, and ResNet tasks, with ASIC and FPGA implementations showing improved performance density and power efficiency while using far less on-chip SRAM. By supporting CKKS, BFV, and BFV via a unified software stack, EFFACT offers a practical path to efficient, scalable FHE acceleration in both ASIC and FPGA domains.

Abstract

Fully Homomorphic Encryption (FHE) is a set of powerful cryptographic schemes that allows computation to be performed directly on encrypted data with an unlimited depth. Despite FHE's promising in privacy-preserving computing, yet in most FHE schemes, ciphertext generally blows up thousands of times compared to the original message, and the massive amount of data load from off-chip memory for bootstrapping and privacy-preserving machine learning applications (such as HELR, ResNet-20), both degrade the performance of FHE-based computation. Several hardware designs have been proposed to address this issue, however, most of them require enormous resources and power. An acceleration platform with easy programmability, high efficiency, and low overhead is a prerequisite for practical application. This paper proposes EFFACT, a highly efficient full-stack FHE acceleration platform with a compiler that provides comprehensive optimizations and vector-friendly hardware. We start by examining the computational overhead across different real-world benchmarks to highlight the potential benefits of reallocating computing resources for efficiency enhancement. Then we make a design space exploration to find an optimal SRAM size with high utilization and low cost. On the other hand, EFFACT features a novel optimization named streaming memory access which is proposed to enable high throughput with limited SRAMs. Regarding the software-side optimization, we also propose a circuit-level function unit reuse scheme, to substantially reduce the computing resources without performance degradation. Moreover, we design novel NTT and automorphism units that are suitable for a cost-sensitive and highly efficient architecture, leading to low area. For generality, EFFACT is also equipped with an ISA and a compiler backend that can support several FHE schemes like CKKS, BGV, and BFV.

EFFACT: A Highly Efficient Full-Stack FHE Acceleration Platform

TL;DR

EFFACT addresses the high data movement and compute demands of fully homomorphic encryption by delivering a full-stack acceleration platform that integrates a generalized ISA, compiler, and vector-friendly hardware. It analyzes real-world FHE workloads to re-balance compute across residue-polynomial operations, eliminates costly BConv bottlenecks, and introduces streaming memory access along with circuit-level resource reuse. The platform demonstrates substantial gains over state-of-the-art accelerators on CKKS bootstrapping, HELR, and ResNet tasks, with ASIC and FPGA implementations showing improved performance density and power efficiency while using far less on-chip SRAM. By supporting CKKS, BFV, and BFV via a unified software stack, EFFACT offers a practical path to efficient, scalable FHE acceleration in both ASIC and FPGA domains.

Abstract

Fully Homomorphic Encryption (FHE) is a set of powerful cryptographic schemes that allows computation to be performed directly on encrypted data with an unlimited depth. Despite FHE's promising in privacy-preserving computing, yet in most FHE schemes, ciphertext generally blows up thousands of times compared to the original message, and the massive amount of data load from off-chip memory for bootstrapping and privacy-preserving machine learning applications (such as HELR, ResNet-20), both degrade the performance of FHE-based computation. Several hardware designs have been proposed to address this issue, however, most of them require enormous resources and power. An acceleration platform with easy programmability, high efficiency, and low overhead is a prerequisite for practical application. This paper proposes EFFACT, a highly efficient full-stack FHE acceleration platform with a compiler that provides comprehensive optimizations and vector-friendly hardware. We start by examining the computational overhead across different real-world benchmarks to highlight the potential benefits of reallocating computing resources for efficiency enhancement. Then we make a design space exploration to find an optimal SRAM size with high utilization and low cost. On the other hand, EFFACT features a novel optimization named streaming memory access which is proposed to enable high throughput with limited SRAMs. Regarding the software-side optimization, we also propose a circuit-level function unit reuse scheme, to substantially reduce the computing resources without performance degradation. Moreover, we design novel NTT and automorphism units that are suitable for a cost-sensitive and highly efficient architecture, leading to low area. For generality, EFFACT is also equipped with an ISA and a compiler backend that can support several FHE schemes like CKKS, BGV, and BFV.

Paper Structure

This paper contains 36 sections, 5 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: (a)The limb-wise and coefficient-wise data in a polynomial on $R_Q$ with RNS system. (b)The level of HE operations.
  • Figure 2: A toy example of performing key-switching in HMULT. $d_0$, $d_1$, and $d_2$ are the intermediate multiplication results noted in Section \ref{['sec:bconv']}, and the $evk_0$ and $evk_2$ are $evk_a$'s Q base and P base representations where the $evk_a$ is the component of evaluation key evk=$(evk_a,evk_b)$. We only show the timing diagram of the branch below the data flow graph (DFG). (a) Example key-switching DFG, (b) The timing graph of architectures with enormous SRAM and buffers that can hold all temporary operands, (c) The timing graph of MAD with limited SRAM and buffers that can only hold 4 operands, (d) The timing graph of MAD with our streaming optimization, in which we successfully reserve the $d_{2ntt}$ for the reuse in the branch above DFG, reducing extra spills. The latency of streaming optimized instruction is determined by the longest latency of the merged instructions. Here is the latency of loading $d_{1}$.
  • Figure 3: Residue polynomial level instruction counts in DBLookup, ResNet20, HELR and Bootstrapping. BC_MULT and BC_ADD are MULT and ADD instructions used in BConv, while MULT and ADD represent the normal MULT and ADD except those in BConv.
  • Figure 4: Impact of different SRAM sizes on utilization and total run time given certain computing resources, we do not show automorphism utilization since it is always low. (a) NTT, and MULT ADD units utilization with different on-chip memory, (b) DRAM bandwidth utilization and total running time with different on-chip memory.
  • Figure 5: EFFACT overall hardware architecture.
  • ...and 6 more figures