Table of Contents
Fetching ...

GPU Acceleration of TFHE-Based High-Precision Nonlinear Layers for Encrypted LLM Inference

Guoci Chen, Xiurui Pan, Qiao Li, Bo Mao, Congming Gao, Chengying Huan, Mingzhe Zhang, Jie Zhang

Abstract

Deploying large language models (LLMs) as cloud services raises privacy concerns as inference may leak sensitive data. Fully Homomorphic Encryption (FHE) allows computation on encrypted data, but current FHE methods struggle with efficient and precise nonlinear function evaluation. Specifically, CKKS-based approaches require high-degree polynomial approximations, which are costly when target precision increases. Alternatively, TFHE's Programmable Bootstrapping (PBS) outperforms CKKS by offering exact lookup-table evaluation. But it lacks high-precision implementations of LLM nonlinear layers and underutilizes GPU resources. We propose \emph{TIGER}, the first GPU-accelerated framework for high-precision TFHE-based nonlinear LLM layer evaluation. TIGER offers: (1) GPU-optimized WoP-PBS method combined with numerical algorithms to surpass native lookup-table precision limits on nonlinear functions; (2) high-precision and efficient implementations of key nonlinear layers, enabling practical encrypted inference; (3) batch-driven design exploiting inter-input parallelism to boost GPU efficiency. TIGER achieves 7.17$\times$, 16.68$\times$, and 17.05$\times$ speedups over a CPU baseline for GELU, Softmax, and LayerNorm, respectively.

GPU Acceleration of TFHE-Based High-Precision Nonlinear Layers for Encrypted LLM Inference

Abstract

Deploying large language models (LLMs) as cloud services raises privacy concerns as inference may leak sensitive data. Fully Homomorphic Encryption (FHE) allows computation on encrypted data, but current FHE methods struggle with efficient and precise nonlinear function evaluation. Specifically, CKKS-based approaches require high-degree polynomial approximations, which are costly when target precision increases. Alternatively, TFHE's Programmable Bootstrapping (PBS) outperforms CKKS by offering exact lookup-table evaluation. But it lacks high-precision implementations of LLM nonlinear layers and underutilizes GPU resources. We propose \emph{TIGER}, the first GPU-accelerated framework for high-precision TFHE-based nonlinear LLM layer evaluation. TIGER offers: (1) GPU-optimized WoP-PBS method combined with numerical algorithms to surpass native lookup-table precision limits on nonlinear functions; (2) high-precision and efficient implementations of key nonlinear layers, enabling practical encrypted inference; (3) batch-driven design exploiting inter-input parallelism to boost GPU efficiency. TIGER achieves 7.17, 16.68, and 17.05 speedups over a CPU baseline for GELU, Softmax, and LayerNorm, respectively.

Paper Structure

This paper contains 21 sections, 1 equation, 7 figures, 7 tables, 2 algorithms.

Figures (7)

  • Figure 1: Single GPT-2 Transformer block. Nonlinear components (LayerNorm, Softmax, and GELU) are highlighted, while linear transformations and residual additions are shown in gray.
  • Figure 2: Time (red) and efficiency (blue) of PBS with different batch sizes; ideal linear scaling is shown in dashed red.
  • Figure 3: Architecture overview of TIGER. TIGER targets nonlinear TFHE layers in encrypted LLM inference and organizes them into three levels: composite nonlinear layer operations, supporting operators (high-precision function evaluation and fixed-point operators), and low-level TFHE primitives. The insets show the lookup-plus-refinement workflow and the multiplication scheduler. Blue indicates core TIGER modules, green indicates partial design or optimization contributions, and gray denotes contextual components. Note that GELU is an high-precision function in implementation, but also evaluated as a standalone nonlinear layer.
  • Figure 4: Multiply scheduler design. The schedule phase first selects only the partial products relevant to the target output range and prunes the rest. It then organizes block-column reduction into multiple dependent passes: blocks at the same column are summed, decomposed into low parts and carries, and written back for later passes. Passes with the same structure from different inputs can be further grouped into batched execution units. The emitted schedule is then consumed by the execute phase to launch kernels and PBS operations and produce the final output blocks.
  • Figure 5: Layer-wise execution time comparison among TIGER, CPU-WoP, and GPU-FBT for GELU, Softmax, and LayerNorm (log scale).
  • ...and 2 more figures