Table of Contents
Fetching ...

Trinity: A General Purpose FHE Accelerator

Xianglong Deng, Shengyu Fan, Zhicheng Hu, Zhuoyu Tian, Zihao Yang, Jiangrui Yu, Dingyuan Cao, Dan Meng, Rui Hou, Meng Li, Qian Lou, Mingzhe Zhang

TL;DR

This paper presents the first multi-modal FHE accelerator based on a unified architecture, which efficiently supports CKKS, TFHE, and their conversion scheme within a single accelerator and achieves 919.3 x performance improvement for the FHE-conversion scheme over the CPU-based implementation.

Abstract

In this paper, we present the first multi-modal FHE accelerator based on a unified architecture, which efficiently supports CKKS, TFHE, and their conversion scheme within a single accelerator. To achieve this goal, we first analyze the theoretical foundations of the aforementioned schemes and highlight their composition from a finite number of arithmetic kernels. Then, we investigate the challenges for efficiently supporting these kernels within a unified architecture, which include 1) concurrent support for NTT and FFT, 2) maintaining high hardware utilization across various polynomial lengths, and 3) ensuring consistent performance across diverse arithmetic kernels. To tackle these challenges, we propose a novel FHE accelerator named Trinity, which incorporates algorithm optimizations, hardware component reuse, and dynamic workload scheduling to enhance the acceleration of CKKS, TFHE, and their conversion scheme. By adaptive select the proper allocation of components for NTT and MAC, Trinity maintains high utilization across NTTs with various polynomial lengths and imbalanced arithmetic workloads. The experiment results show that, for the pure CKKS and TFHE workloads, the performance of our Trinity outperforms the state-of-the-art accelerator for CKKS (SHARP) and TFHE (Morphling) by 1.49x and 4.23x, respectively. Moreover, Trinity achieves 919.3x performance improvement for the FHE-conversion scheme over the CPU-based implementation. Notably, despite the performance improvement, the hardware overhead of Trinity is only 85% of the summed circuit areas of SHARP and Morphling.

Trinity: A General Purpose FHE Accelerator

TL;DR

This paper presents the first multi-modal FHE accelerator based on a unified architecture, which efficiently supports CKKS, TFHE, and their conversion scheme within a single accelerator and achieves 919.3 x performance improvement for the FHE-conversion scheme over the CPU-based implementation.

Abstract

In this paper, we present the first multi-modal FHE accelerator based on a unified architecture, which efficiently supports CKKS, TFHE, and their conversion scheme within a single accelerator. To achieve this goal, we first analyze the theoretical foundations of the aforementioned schemes and highlight their composition from a finite number of arithmetic kernels. Then, we investigate the challenges for efficiently supporting these kernels within a unified architecture, which include 1) concurrent support for NTT and FFT, 2) maintaining high hardware utilization across various polynomial lengths, and 3) ensuring consistent performance across diverse arithmetic kernels. To tackle these challenges, we propose a novel FHE accelerator named Trinity, which incorporates algorithm optimizations, hardware component reuse, and dynamic workload scheduling to enhance the acceleration of CKKS, TFHE, and their conversion scheme. By adaptive select the proper allocation of components for NTT and MAC, Trinity maintains high utilization across NTTs with various polynomial lengths and imbalanced arithmetic workloads. The experiment results show that, for the pure CKKS and TFHE workloads, the performance of our Trinity outperforms the state-of-the-art accelerator for CKKS (SHARP) and TFHE (Morphling) by 1.49x and 4.23x, respectively. Moreover, Trinity achieves 919.3x performance improvement for the FHE-conversion scheme over the CPU-based implementation. Notably, despite the performance improvement, the hardware overhead of Trinity is only 85% of the summed circuit areas of SHARP and Morphling.

Paper Structure

This paper contains 40 sections, 16 figures, 12 tables, 5 algorithms.

Figures (16)

  • Figure 1: Utilization of F1-like NTT and FAB-like NTT when computing NTT of varying lengths. For a fair analysis, both the F1-like NTT and the FAB-like NTT are configured with comparable modular multipliers. The F1-like NTT includes eight stages of butterfly units and processes 256 elements in parallel per cycle. In contrast, the FAB-like NTT consists of a single butterfly stage capable of processing 2048 elements in parallel per cycle. Both NTT employ radix-2 NTT and support four-step NTT. The utilization rate is computed considering a single butterfly stage as the finest granularity.
  • Figure 2: The computational amount breakdown of NTT and MAC operation in CKKS KeySwitch ($L$ = 23, dnum = 3) and TFHE PBS.
  • Figure 3: Overall Architecture of Trinity. NTTU denotes the NTT unit. TP denotes transpose unit. CU-$x$ denotes a configurable unit with $x$-column PEs.
  • Figure 4: The structure of NTTU. We denote the number of rows of the BU array as $M$. In the default configuration of Trinity, we set $M$ as 128, and NTTU processes 256 elements each cycle. For simplicity, here we take $M$ = 4 as an example.
  • Figure 5: CU-$x$ architecture
  • ...and 11 more figures