Chameleon: An Efficient FHE Scheme Switching Acceleration on GPUs

Zhiwei Wang; Haoqi He; Lutan Zhao; Peinan Li; Zhihao Li; Dan Meng; Rui Hou

Chameleon: An Efficient FHE Scheme Switching Acceleration on GPUs

Zhiwei Wang, Haoqi He, Lutan Zhao, Peinan Li, Zhihao Li, Dan Meng, Rui Hou

TL;DR

This article proposes a scalable NTT acceleration design that adapts to larger CKKS polynomials and smaller TFHE polynomials and introduces CMux-level parallelization to accelerate LUT evaluation and a homomorphic rotation-free matrix-vector multiplication to improve repacking efficiency.

Abstract

Fully homomorphic encryption (FHE) enables direct computation on encrypted data, making it a crucial technology for privacy protection. However, FHE suffers from significant performance bottlenecks. In this context, GPU acceleration offers a promising solution to bridge the performance gap. Existing efforts primarily focus on single-class FHE schemes, which fail to meet the diverse requirements of data types and functions, prompting the development of hybrid multi-class FHE schemes. However, studies have yet to thoroughly investigate specific GPU optimizations for hybrid FHE schemes. In this paper, we present an efficient GPU-based FHE scheme switching acceleration named Chameleon. First, we propose a scalable NTT acceleration design that adapts to larger CKKS polynomials and smaller TFHE polynomials. Specifically, Chameleon tackles synchronization issues by fusing stages to reduce synchronization, employing polynomial coefficient shuffling to minimize synchronization scale, and utilizing an SM-aware combination strategy to identify the optimal switching point. Second, Chameleon is the first to comprehensively analyze and optimize critical switching operations. It introduces CMux-level parallelization to accelerate LUT evaluation and a homomorphic rotation-free matrix-vector multiplication to improve repacking efficiency. Finally, Chameleon outperforms the state-of-the-art GPU implementations by 1.23x in CKKS HMUL and 1.15x in bootstrapping. It also achieves up to 4.87x and 1.51x speedups for TFHE gate bootstrapping compared to CPU and GPU versions, respectively, and delivers a 67.3x average speedup for scheme switching over CPU-based implementation.

Chameleon: An Efficient FHE Scheme Switching Acceleration on GPUs

TL;DR

Abstract

Paper Structure (24 sections, 1 equation, 14 figures, 7 tables)

This paper contains 24 sections, 1 equation, 14 figures, 7 tables.

Introduction
Background
CKKS Scheme
TFHE Scheme
Scheme Switching Algorithm
Number Theoretic Transform
GPU Overview
Design Overview
Critical Switching Operations Acceleration on GPUs
LUT Acceleration with CMux Gate-level Parallelism
Repack Acceleration with Homomorphic Rotation-free MatVec Optimization
Scalable NTT Acceleration Design on GPUs
Butterfly Decomposition-based NTT for TFHE Polynomial
Thread Aggregation-based NTT for CKKS Polynomial
Polynomial Coefficient Shuffling-based NTT
...and 9 more sections

Figures (14)

Figure 1: Working flow of Chameleon.
Figure 2: Performance breakdown of scheme switching between CKKS and TFHE in Pegasus ($n_{\rm ckks}$=$2^{16}$, $n_{\rm lwe}$=$2^{10}$, $n_{\rm lut}$=$2^{12}$, ciphertext modulus=599 bits).
Figure 3: LUT acceleration with two levels of parallelism.
Figure 4: Repack acceleration with homomorphic rotation-free MatVec optimization.
Figure 5: Butterfly decomposition principle. The calculation of the decomposed same-color area is completed by the same thread without synchronization.
...and 9 more figures

Chameleon: An Efficient FHE Scheme Switching Acceleration on GPUs

TL;DR

Abstract

Chameleon: An Efficient FHE Scheme Switching Acceleration on GPUs

Authors

TL;DR

Abstract

Table of Contents

Figures (14)