Taiyi: A high-performance CKKS accelerator for Practical Fully Homomorphic Encryption

Shengyu Fan; Xianglong Deng; Zhuoyu Tian; Zhicheng Hu; Liang Chang; Rui Hou; Dan Meng; Mingzhe Zhang

Taiyi: A high-performance CKKS accelerator for Practical Fully Homomorphic Encryption

Shengyu Fan, Xianglong Deng, Zhuoyu Tian, Zhicheng Hu, Liang Chang, Rui Hou, Dan Meng, Mingzhe Zhang

TL;DR

This work targets practical CKKS FHE by addressing KLSS-driven shifts in bottlenecks, notably moving the workload from NTT to Inner-Product operations. It introduces Taiyi, an ASIC accelerator with dedicated IP hardware (HP-IP), a multi-step NTT architecture, and compiler-assisted dynamic selection of the KeySwitch parameter α to adapt to multiplicative depth. Key contributions include dataflow optimizations that reduce BConv and PtMatVecMult overhead, a memory-conscious E-Key Buffer design, and a static compiler that tunes α across levels, achieving about 1.5x average throughput improvement and 15.7% area reduction on typical CKKS workloads. The results demonstrate substantial gains in performance per area and energy efficiency, bringing KLSS-based FHE closer to real-world deployment and highlighting the importance of co-design between cryptographic algorithms and accelerator architectures.

Abstract

Fully Homomorphic Encryption (FHE), a novel cryptographic theory enabling computation directly on ciphertext data, offers significant security benefits but is hampered by substantial performance overhead. In recent years, a series of accelerator designs have significantly enhanced the performance of FHE applications, bringing them closer to real-world applicability. However, these accelerators face challenges related to large on-chip memory and area. Additionally, FHE algorithms undergo rapid development, rendering the previous accelerator designs less perfectly adapted to the evolving landscape of optimized FHE applications. In this paper, we conducted a detailed analysis of existing applications with the new FHE method, making two key observations: 1) the bottleneck of FHE applications shifts from NTT to the inner-product operation, and 2) the optimal α of KeySwitch changes with the decrease in multiplicative level. Based on these observations, we designed an accelerator named Taiyi, which includes specific hardware for the inner-product operation and optimizes the NTT and BConv operations through algorithmic derivation. A comparative evaluation of Taiyi against previous state-of-the-art designs reveals an average performance improvement of 1.5x and reduces the area overhead by 15.7%.

Taiyi: A high-performance CKKS accelerator for Practical Fully Homomorphic Encryption

TL;DR

Abstract

Paper Structure (34 sections, 10 equations, 15 figures, 7 tables, 2 algorithms)

This paper contains 34 sections, 10 equations, 15 figures, 7 tables, 2 algorithms.

Introduction
Background
CKKS: A Practical and Promising FHE Scheme
CKKS parameters
KLSS: A Breakthrough KeySwitch Method
Four-Step NTT algorithm
Overview of Previous FHE Accelerators
Motivation
Contrasting Computation Workload and Memory Requirements
Performance Analysis and Optimal CKKS Parameter Selection
Real-world application breakdown analysis
Diverging the best $\alpha$ for different multiplicative level
Opportunities
Design
Multi-step (I)NTT Architeture
...and 19 more sections

Figures (15)

Figure 1: Computational arithmetic multiplication number breakdown. Results are measured for all possible values of $d_\text{num}$ and N.
Figure 2: The On-Chip Memory Requirement for the KLSS-based KeySwitch method for IP operation.
Figure 3: KeySwitch Execution Time Breakdown in FHE Application Using the KLSS-based method. The KeySwitch execution time constitutes 73.7%, 85.5%, and 84.9% of the total execution time for the respective cases.
Figure 4: The relationship between the number of ModMul operations and the decrease of multiplicative level ($l$). $\alpha$ is the number of single digits.
Figure 5: Organization of the NTTU and the data-map for input polynomial. Each NTTU serves 256 elements in one cycle, and Taiyi contains 16 lanes and each lane processes 16 elements on one cluster. For simplicity, here we take M=4 for each NTTU group as an example.
...and 10 more figures

Taiyi: A high-performance CKKS accelerator for Practical Fully Homomorphic Encryption

TL;DR

Abstract

Taiyi: A high-performance CKKS accelerator for Practical Fully Homomorphic Encryption

Authors

TL;DR

Abstract

Table of Contents

Figures (15)