CAT: A GPU-Accelerated FHE Framework with Its Application to High-Precision Private Dataset Query
Qirui Li, Rui Zong
TL;DR
CAT introduces a GPU-accelerated, three-layer FHE framework that unifies core math, pre-computed elements, and API-level operators under two GPU pools to achieve large performance gains. It implements CKKS, BFV, and BGV, delivering up to 2173× speedups over CPU implementations and enabling CKKS-based PDQ queries that complete on 10^3 rows within 1 second using 2–5 GB of memory. A novel multiplicative-masking two-party protocol enables high-precision, bootstrapping-free encrypted "search-and-average" queries, expanding practical PDQ feasibility. The work demonstrates strong real-world potential for integrating state-of-the-art FHE into systems, provides open-source tooling, and outlines future enhancements in bootstrapping, scheme switching, and expanded query operators.
Abstract
We introduce an open-source GPU-accelerated fully homomorphic encryption (FHE) framework CAT, which surpasses existing solutions in functionality and efficiency. \emph{CAT} features a three-layer architecture: a foundation of core math, a bridge of pre-computed elements and combined operations, and an API-accessible layer of FHE operators. It utilizes techniques such as parallel executed operations, well-defined layout patterns of cipher data, kernel fusion/segmentation, and dual GPU pools to enhance the overall execution efficiency. In addition, a memory management mechanism ensures server-side suitability and prevents data leakage. Based on our framework, we implement three widely used FHE schemes: CKKS, BFV, and BGV. The results show that our implementation on Nvidia 4090 can achieve up to 2173$\times$ speedup over CPU implementation and 1.25$\times$ over state-of-the-art GPU acceleration work for specific operations. What's more, we offer a scenario validation with CKKS-based Privacy Database Queries, achieving a 33$\times$ speedup over its CPU counterpart. All query tasks can handle datasets up to $10^3$ rows on a single GPU within 1 second, using 2-5 GB storage. Our implementation has undergone extensive stability testing and can be easily deployed on commercial GPUs. We hope that our work will significantly advance the integration of state-of-the-art FHE algorithms into diverse real-world systems by providing a robust, industry-ready, and open-source tool.
