HE-LRM: Encrypted Deep Learning Recommendation Models using Fully Homomorphic Encryption
Karthik Garimella, Austin Ebel, Gabrielle De Micheli, Brandon Reagen
TL;DR
HE-LRM advances privacy-preserving inference for Deep Learning Recommendation Models by introducing a client-side digit-decomposition embedding compression and a multi-embedding diagonal packing strategy, enabling end-to-end CKKS-based DLRM inference within the Orion framework. The approach achieves up to $77\times$ embedding compression speedups over prior work, supports parallel lookups across multiple embedding tables, and demonstrates CPU-end-to-end latencies of $24.22$ s on UCI Heart and $\sim$213–489 s on Criteo, with hardware accelerators (GPU/ASIC) projected to bring latencies to seconds or sub-seconds. The work also discusses threat models, embedding-lookup trade-offs, and detailed results on training and deployment, highlighting practical implications for privacy-preserving recommendations. Overall, HE-LRM brings encrypted DLRMs closer to real-world deployment by balancing compression, packing, and accelerator-aware execution within the CKKS-based private inference paradigm.
Abstract
Fully Homomorphic Encryption (FHE) allows for computation directly on encrypted data and enables privacy-preserving neural inference in the cloud. Prior work has focused on models with dense inputs (e.g., CNNs), with less attention given to those with sparse inputs such as Deep Learning Recommendation Models (DLRMs). These models require encrypted lookup into large embedding tables that are challenging to implement using FHE's restrictive operators and introduces significant overhead. In this paper, we develop performance optimizations to efficiently support sparse features and neural recommendation in FHE.First, we present an embedding compression technique using client-side digit decomposition that achieves 77$\times$ speedup over state-of-the-art. Next, we propose a multi-embedding packing strategy that enables ciphertext SIMD-parallel lookups across multiple tables. We name our approach HE-LRM and integrate it into the open-source Orion FHE framework to demonstrate end-to-end encrypted DLRM inference. We evaluate HE-LRM on UCI (health prediction) and Criteo (click prediction), achieving inference latencies of 24 and 489 seconds, respectively, on a single-threaded CPU. Finally, we show how GPU and ASIC FHE acceleration can reduce end-to-end latencies to seconds and even sub-seconds, making encrypted recommendations near practical.
