Table of Contents
Fetching ...

Leveraging ASIC AI Chips for Homomorphic Encryption

Jianming Tong, Tianhao Huang, Jingtian Dang, Leo de Castro, Anirudh Itagi, Anupam Golder, Asra Ali, Jeremy Kun, Jevin Jiang, Arvind, G. Edward Suh, Tushar Krishna

TL;DR

This paper presents CROSS, a compiler-based framework that enables efficient execution of Homomorphic Encryption (HE) workloads on AI accelerators like Google TPUs by recasting high-precision modular arithmetic into dense low-precision matrix multiplications and embedding data layout changes into computation. The two main techniques, Basis-Aligned Transformation (BAT) and Memory-Aligned Transformation (MAT), convert 32-bit arithmetic into 8-bit MatMul and fuse reordering (transpose/bit-reverse) into offline parameter transformations, respectively. On TPUv6e, CROSS achieves state-of-the-art throughput and energy efficiency for key HE operators (NTT, ModMul, BConv) and HE ML workloads (MNIST, LR), outperforming SoTA CPU, GPU, FPGA, and some HE ASICs, though a gap remains to dedicated FHE ASICs due to fixed moduli, specialized shuffling hardware, and memory footprint. Overall, CROSS demonstrates that AI accelerators can serve as a practical, energy-efficient platform for privacy-preserving computation with compiler-driven optimization, and points to future work to narrow remaining gaps and extend to end-to-end privacy-preserving pipelines.

Abstract

Homomorphic Encryption (HE) provides strong data privacy for cloud services but at the cost of prohibitive computational overhead. While GPUs have emerged as a practical platform for accelerating HE, there remains an order-of-magnitude energy-efficiency gap compared to specialized (but expensive) HE ASICs. This paper explores an alternate direction: leveraging existing AI accelerators, like Google's TPUs with coarse-grained compute and memory architectures, to offer a path toward ASIC-level energy efficiency for HE. However, this architectural paradigm creates a fundamental mismatch with SoTA HE algorithms designed for GPUs. These algorithms rely heavily on: (1) high-precision (32-bit) integer arithmetic to now run on a TPU's low-throughput vector unit, leaving its high-throughput low-precision (8-bit) matrix engine (MXU) idle, and (2) fine-grained data permutations that are inefficient on the TPU's coarse-grained memory subsystem. Consequently, porting GPU-optimized HE libraries to TPUs results in severe resource under-utilization and performance degradation. To tackle above challenges, we introduce CROSS, a compiler framework that systematically transforms HE workloads to align with the TPU's architecture. CROSS makes two key contributions: (1) Basis-Aligned Transformation (BAT), a novel technique that converts high-precision modular arithmetic into dense, low-precision (INT8) matrix multiplications, unlocking and improving the utilization of TPU's MXU for HE, and (2) Memory-Aligned Transformation (MAT), which eliminates costly runtime data reordering by embedding reordering into compute kernels through offline parameter transformation. CROSS (TPU v6e) achieves higher throughput per watt on NTT and HE operators than WarpDrive, FIDESlib, FAB, HEAP, and Cheddar, establishing AI ASIC as the SotA efficient platform for HE operators. Code: https://github.com/EfficientPPML/CROSS

Leveraging ASIC AI Chips for Homomorphic Encryption

TL;DR

This paper presents CROSS, a compiler-based framework that enables efficient execution of Homomorphic Encryption (HE) workloads on AI accelerators like Google TPUs by recasting high-precision modular arithmetic into dense low-precision matrix multiplications and embedding data layout changes into computation. The two main techniques, Basis-Aligned Transformation (BAT) and Memory-Aligned Transformation (MAT), convert 32-bit arithmetic into 8-bit MatMul and fuse reordering (transpose/bit-reverse) into offline parameter transformations, respectively. On TPUv6e, CROSS achieves state-of-the-art throughput and energy efficiency for key HE operators (NTT, ModMul, BConv) and HE ML workloads (MNIST, LR), outperforming SoTA CPU, GPU, FPGA, and some HE ASICs, though a gap remains to dedicated FHE ASICs due to fixed moduli, specialized shuffling hardware, and memory footprint. Overall, CROSS demonstrates that AI accelerators can serve as a practical, energy-efficient platform for privacy-preserving computation with compiler-driven optimization, and points to future work to narrow remaining gaps and extend to end-to-end privacy-preserving pipelines.

Abstract

Homomorphic Encryption (HE) provides strong data privacy for cloud services but at the cost of prohibitive computational overhead. While GPUs have emerged as a practical platform for accelerating HE, there remains an order-of-magnitude energy-efficiency gap compared to specialized (but expensive) HE ASICs. This paper explores an alternate direction: leveraging existing AI accelerators, like Google's TPUs with coarse-grained compute and memory architectures, to offer a path toward ASIC-level energy efficiency for HE. However, this architectural paradigm creates a fundamental mismatch with SoTA HE algorithms designed for GPUs. These algorithms rely heavily on: (1) high-precision (32-bit) integer arithmetic to now run on a TPU's low-throughput vector unit, leaving its high-throughput low-precision (8-bit) matrix engine (MXU) idle, and (2) fine-grained data permutations that are inefficient on the TPU's coarse-grained memory subsystem. Consequently, porting GPU-optimized HE libraries to TPUs results in severe resource under-utilization and performance degradation. To tackle above challenges, we introduce CROSS, a compiler framework that systematically transforms HE workloads to align with the TPU's architecture. CROSS makes two key contributions: (1) Basis-Aligned Transformation (BAT), a novel technique that converts high-precision modular arithmetic into dense, low-precision (INT8) matrix multiplications, unlocking and improving the utilization of TPU's MXU for HE, and (2) Memory-Aligned Transformation (MAT), which eliminates costly runtime data reordering by embedding reordering into compute kernels through offline parameter transformation. CROSS (TPU v6e) achieves higher throughput per watt on NTT and HE operators than WarpDrive, FIDESlib, FAB, HEAP, and Cheddar, establishing AI ASIC as the SotA efficient platform for HE operators. Code: https://github.com/EfficientPPML/CROSS
Paper Structure (68 sections, 8 equations, 16 figures, 10 tables, 5 algorithms)

This paper contains 68 sections, 8 equations, 16 figures, 10 tables, 5 algorithms.

Figures (16)

  • Figure 1: CROSS enables direct computation on encrypted data to enable privacy-preserving model serving on AI ASICs.
  • Figure 2: TPU's compute/memory granularity is $>$ GPU.
  • Figure 3: CROSS aims at (1) eliminating compute redundancy, (2) leveraging powerful MXU for throughput improvement, and (3) removing explicit memory costs for better efficiency.
  • Figure 4: Overview of TPUv4 architecture based on public information tpuv4iTPUv2jouppi2023tputpu_web_doc. Four black and gray boxes represent two tensor cores, separately. Two tensor cores share the same 128 MB common memory (CMEM, removed in newer TPUs) to hold frequently used data. Each tensor core has 4 matrix multiplication units (MXU) and 2048 ALUs in Vector Processing Unit (VPU) organized as 128 SIMD lanes. Each lane consists of 8 SIMD sublanes with 128 KB vector memory (VMEM), each sublane has 2 dual-issue ALUs and 128 B local register file. Such two level of SIMDs force a group of (8, 128) 32-bit registers, termed as VReg, to be operated in the lock step. Each MXU features a $128{\times}128$ systolic array ($256{\times}256$ for TPUs after v6) for performing matrix multiplication. Each MXU has a local transpose unit to optionally transpose right-hand-side (RHS) input matrix in the pipelined manner to hide transpose latency behind. Data in VMEM of different lanes could get transposed or shuffled or accumulated through the Cross Lane Unit (XLU), which consumes non-hidden layout reordering and reduction latency.
  • Figure 5: AI ASICs deliver better energy efficiency among practical devices using the same technology nodes.
  • ...and 11 more figures