Taiyi: A high-performance CKKS accelerator for Practical Fully Homomorphic Encryption
Shengyu Fan, Xianglong Deng, Zhuoyu Tian, Zhicheng Hu, Liang Chang, Rui Hou, Dan Meng, Mingzhe Zhang
TL;DR
This work targets practical CKKS FHE by addressing KLSS-driven shifts in bottlenecks, notably moving the workload from NTT to Inner-Product operations. It introduces Taiyi, an ASIC accelerator with dedicated IP hardware (HP-IP), a multi-step NTT architecture, and compiler-assisted dynamic selection of the KeySwitch parameter α to adapt to multiplicative depth. Key contributions include dataflow optimizations that reduce BConv and PtMatVecMult overhead, a memory-conscious E-Key Buffer design, and a static compiler that tunes α across levels, achieving about 1.5x average throughput improvement and 15.7% area reduction on typical CKKS workloads. The results demonstrate substantial gains in performance per area and energy efficiency, bringing KLSS-based FHE closer to real-world deployment and highlighting the importance of co-design between cryptographic algorithms and accelerator architectures.
Abstract
Fully Homomorphic Encryption (FHE), a novel cryptographic theory enabling computation directly on ciphertext data, offers significant security benefits but is hampered by substantial performance overhead. In recent years, a series of accelerator designs have significantly enhanced the performance of FHE applications, bringing them closer to real-world applicability. However, these accelerators face challenges related to large on-chip memory and area. Additionally, FHE algorithms undergo rapid development, rendering the previous accelerator designs less perfectly adapted to the evolving landscape of optimized FHE applications. In this paper, we conducted a detailed analysis of existing applications with the new FHE method, making two key observations: 1) the bottleneck of FHE applications shifts from NTT to the inner-product operation, and 2) the optimal α of KeySwitch changes with the decrease in multiplicative level. Based on these observations, we designed an accelerator named Taiyi, which includes specific hardware for the inner-product operation and optimizes the NTT and BConv operations through algorithmic derivation. A comparative evaluation of Taiyi against previous state-of-the-art designs reveals an average performance improvement of 1.5x and reduces the area overhead by 15.7%.
