Table of Contents
Fetching ...

Chameleon: An Efficient FHE Scheme Switching Acceleration on GPUs

Zhiwei Wang, Haoqi He, Lutan Zhao, Peinan Li, Zhihao Li, Dan Meng, Rui Hou

TL;DR

This article proposes a scalable NTT acceleration design that adapts to larger CKKS polynomials and smaller TFHE polynomials and introduces CMux-level parallelization to accelerate LUT evaluation and a homomorphic rotation-free matrix-vector multiplication to improve repacking efficiency.

Abstract

Fully homomorphic encryption (FHE) enables direct computation on encrypted data, making it a crucial technology for privacy protection. However, FHE suffers from significant performance bottlenecks. In this context, GPU acceleration offers a promising solution to bridge the performance gap. Existing efforts primarily focus on single-class FHE schemes, which fail to meet the diverse requirements of data types and functions, prompting the development of hybrid multi-class FHE schemes. However, studies have yet to thoroughly investigate specific GPU optimizations for hybrid FHE schemes. In this paper, we present an efficient GPU-based FHE scheme switching acceleration named Chameleon. First, we propose a scalable NTT acceleration design that adapts to larger CKKS polynomials and smaller TFHE polynomials. Specifically, Chameleon tackles synchronization issues by fusing stages to reduce synchronization, employing polynomial coefficient shuffling to minimize synchronization scale, and utilizing an SM-aware combination strategy to identify the optimal switching point. Second, Chameleon is the first to comprehensively analyze and optimize critical switching operations. It introduces CMux-level parallelization to accelerate LUT evaluation and a homomorphic rotation-free matrix-vector multiplication to improve repacking efficiency. Finally, Chameleon outperforms the state-of-the-art GPU implementations by 1.23x in CKKS HMUL and 1.15x in bootstrapping. It also achieves up to 4.87x and 1.51x speedups for TFHE gate bootstrapping compared to CPU and GPU versions, respectively, and delivers a 67.3x average speedup for scheme switching over CPU-based implementation.

Chameleon: An Efficient FHE Scheme Switching Acceleration on GPUs

TL;DR

This article proposes a scalable NTT acceleration design that adapts to larger CKKS polynomials and smaller TFHE polynomials and introduces CMux-level parallelization to accelerate LUT evaluation and a homomorphic rotation-free matrix-vector multiplication to improve repacking efficiency.

Abstract

Fully homomorphic encryption (FHE) enables direct computation on encrypted data, making it a crucial technology for privacy protection. However, FHE suffers from significant performance bottlenecks. In this context, GPU acceleration offers a promising solution to bridge the performance gap. Existing efforts primarily focus on single-class FHE schemes, which fail to meet the diverse requirements of data types and functions, prompting the development of hybrid multi-class FHE schemes. However, studies have yet to thoroughly investigate specific GPU optimizations for hybrid FHE schemes. In this paper, we present an efficient GPU-based FHE scheme switching acceleration named Chameleon. First, we propose a scalable NTT acceleration design that adapts to larger CKKS polynomials and smaller TFHE polynomials. Specifically, Chameleon tackles synchronization issues by fusing stages to reduce synchronization, employing polynomial coefficient shuffling to minimize synchronization scale, and utilizing an SM-aware combination strategy to identify the optimal switching point. Second, Chameleon is the first to comprehensively analyze and optimize critical switching operations. It introduces CMux-level parallelization to accelerate LUT evaluation and a homomorphic rotation-free matrix-vector multiplication to improve repacking efficiency. Finally, Chameleon outperforms the state-of-the-art GPU implementations by 1.23x in CKKS HMUL and 1.15x in bootstrapping. It also achieves up to 4.87x and 1.51x speedups for TFHE gate bootstrapping compared to CPU and GPU versions, respectively, and delivers a 67.3x average speedup for scheme switching over CPU-based implementation.
Paper Structure (24 sections, 1 equation, 14 figures, 7 tables)

This paper contains 24 sections, 1 equation, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Working flow of Chameleon.
  • Figure 2: Performance breakdown of scheme switching between CKKS and TFHE in Pegasus ($n_{\rm ckks}$=$2^{16}$, $n_{\rm lwe}$=$2^{10}$, $n_{\rm lut}$=$2^{12}$, ciphertext modulus=599 bits).
  • Figure 3: LUT acceleration with two levels of parallelism.
  • Figure 4: Repack acceleration with homomorphic rotation-free MatVec optimization.
  • Figure 5: Butterfly decomposition principle. The calculation of the decomposed same-color area is completed by the same thread without synchronization.
  • ...and 9 more figures