Table of Contents
Fetching ...

HE-LRM: Encrypted Deep Learning Recommendation Models using Fully Homomorphic Encryption

Karthik Garimella, Austin Ebel, Gabrielle De Micheli, Brandon Reagen

TL;DR

HE-LRM advances privacy-preserving inference for Deep Learning Recommendation Models by introducing a client-side digit-decomposition embedding compression and a multi-embedding diagonal packing strategy, enabling end-to-end CKKS-based DLRM inference within the Orion framework. The approach achieves up to $77\times$ embedding compression speedups over prior work, supports parallel lookups across multiple embedding tables, and demonstrates CPU-end-to-end latencies of $24.22$ s on UCI Heart and $\sim$213–489 s on Criteo, with hardware accelerators (GPU/ASIC) projected to bring latencies to seconds or sub-seconds. The work also discusses threat models, embedding-lookup trade-offs, and detailed results on training and deployment, highlighting practical implications for privacy-preserving recommendations. Overall, HE-LRM brings encrypted DLRMs closer to real-world deployment by balancing compression, packing, and accelerator-aware execution within the CKKS-based private inference paradigm.

Abstract

Fully Homomorphic Encryption (FHE) allows for computation directly on encrypted data and enables privacy-preserving neural inference in the cloud. Prior work has focused on models with dense inputs (e.g., CNNs), with less attention given to those with sparse inputs such as Deep Learning Recommendation Models (DLRMs). These models require encrypted lookup into large embedding tables that are challenging to implement using FHE's restrictive operators and introduces significant overhead. In this paper, we develop performance optimizations to efficiently support sparse features and neural recommendation in FHE.First, we present an embedding compression technique using client-side digit decomposition that achieves 77$\times$ speedup over state-of-the-art. Next, we propose a multi-embedding packing strategy that enables ciphertext SIMD-parallel lookups across multiple tables. We name our approach HE-LRM and integrate it into the open-source Orion FHE framework to demonstrate end-to-end encrypted DLRM inference. We evaluate HE-LRM on UCI (health prediction) and Criteo (click prediction), achieving inference latencies of 24 and 489 seconds, respectively, on a single-threaded CPU. Finally, we show how GPU and ASIC FHE acceleration can reduce end-to-end latencies to seconds and even sub-seconds, making encrypted recommendations near practical.

HE-LRM: Encrypted Deep Learning Recommendation Models using Fully Homomorphic Encryption

TL;DR

HE-LRM advances privacy-preserving inference for Deep Learning Recommendation Models by introducing a client-side digit-decomposition embedding compression and a multi-embedding diagonal packing strategy, enabling end-to-end CKKS-based DLRM inference within the Orion framework. The approach achieves up to embedding compression speedups over prior work, supports parallel lookups across multiple embedding tables, and demonstrates CPU-end-to-end latencies of s on UCI Heart and 213–489 s on Criteo, with hardware accelerators (GPU/ASIC) projected to bring latencies to seconds or sub-seconds. The work also discusses threat models, embedding-lookup trade-offs, and detailed results on training and deployment, highlighting practical implications for privacy-preserving recommendations. Overall, HE-LRM brings encrypted DLRMs closer to real-world deployment by balancing compression, packing, and accelerator-aware execution within the CKKS-based private inference paradigm.

Abstract

Fully Homomorphic Encryption (FHE) allows for computation directly on encrypted data and enables privacy-preserving neural inference in the cloud. Prior work has focused on models with dense inputs (e.g., CNNs), with less attention given to those with sparse inputs such as Deep Learning Recommendation Models (DLRMs). These models require encrypted lookup into large embedding tables that are challenging to implement using FHE's restrictive operators and introduces significant overhead. In this paper, we develop performance optimizations to efficiently support sparse features and neural recommendation in FHE.First, we present an embedding compression technique using client-side digit decomposition that achieves 77 speedup over state-of-the-art. Next, we propose a multi-embedding packing strategy that enables ciphertext SIMD-parallel lookups across multiple tables. We name our approach HE-LRM and integrate it into the open-source Orion FHE framework to demonstrate end-to-end encrypted DLRM inference. We evaluate HE-LRM on UCI (health prediction) and Criteo (click prediction), achieving inference latencies of 24 and 489 seconds, respectively, on a single-threaded CPU. Finally, we show how GPU and ASIC FHE acceleration can reduce end-to-end latencies to seconds and even sub-seconds, making encrypted recommendations near practical.

Paper Structure

This paper contains 31 sections, 1 equation, 10 figures, 1 table.

Figures (10)

  • Figure 1: Architecture of a Deep Learning Recommendation Model (DLRM). HE-LRM utilizes Fully Homomorphic Encryption (FHE) to perform end-to-end encrypted inference of this model, maintaining data privacy.
  • Figure 2: Single-threaded CPU latencies of primitive CKKS operations averaged over 30 runs. Both ciphertext-ciphertext (CT-CT) multiplication and ciphertext (CT) rotation require a compute and memory intensive key-switching operation. CKKS multiplies require ciphertexts to have at least two remaining levels.
  • Figure 3: Embedding table sizes (number of rows) for each of the 26 categorical features in the Criteo dataset for a total of $33.8$ million rows.
  • Figure 4: Prior work kim2024privacypreserving compressed embedding lookup with $k$ rows and an embedding dimension of size $d = 3$. The client must store the coded token mapping of size $k \times \ell$ locally. Prior work performs this compressed lookup homomorphically by utilizing $p \ell$ slots per ciphertext and performing one homomorphic $\mathsf{Indicator}$ function followed by $d$ multiplications with the concatenated columns of the compressed tables followed by $d \log (p\ell)$ rotations and summations (see Table 4 of kim2024privacypreserving). Rotation and summation within a ciphertext produces several wasted slots that are filled with invalid data. In this example, two separate tokens ("cat" and "dog") fit into a single 16-slotted ciphertext.
  • Figure 5: Comparison of encrypted embedding lookups using a fixed set of FHE parameters for different settings of $k = p^\ell$ and embedding dimensions, $d$. CodedHELUT first uses the Encrypted Indicator Function to perform the one-hot encoding server side. This function requires a bootstrap operation given our FHE configuration. For the embedding lookup stage, swapping out their suggested TableMult algorithm with a double-hoisted BSGS linear transformation reduces runtime as TableMult requires separate rotations for every hidden dimension, $d$. Our proposed solution performs the one-hot encoding client-side while still requiring the same number of slots as the prior method and directly leverages double-hoisted BSGS.
  • ...and 5 more figures