Table of Contents
Fetching ...

Mem-Rec: Memory Efficient Recommendation System using Alternative Representation

Gopi Krishna Jha, Anthony Thomas, Nilesh Jain, Sameh Gobriel, Tajana Rosing, Ravi Iyer

TL;DR

The paper addresses the heavy memory footprint of embedding tables in DLRM-based recommender systems by introducing MEM-REC, a dual Bloom-filter encoding that compresses categorical features into two small, cache-friendly embedding paths. The final embedding is computed as $z(x) = \alpha(x) M \phi(x)$, where $\phi(x)$ is the Bloom-based token signature and $\alpha(x)$ is a data-dependent weight from a second Bloom encoder, enabling memory usage to scale logarithmically with the vocabulary size while preserving accuracy. Empirically, MEM-REC achieves iso-quality compression (no AUC loss) with up to $2904\times$ model-size reduction and up to $3.4\times$ faster embedding computations, while fitting within LLC caches and reducing data movement by large factors. This approach significantly improves the practicality of large-scale, production-grade recommender systems by alleviating memory-bandwidth bottlenecks without compromising predictive performance.

Abstract

Deep learning-based recommendation systems (e.g., DLRMs) are widely used AI models to provide high-quality personalized recommendations. Training data used for modern recommendation systems commonly includes categorical features taking on tens-of-millions of possible distinct values. These categorical tokens are typically assigned learned vector representations, that are stored in large embedding tables, on the order of 100s of GB. Storing and accessing these tables represent a substantial burden in commercial deployments. Our work proposes MEM-REC, a novel alternative representation approach for embedding tables. MEM-REC leverages bloom filters and hashing methods to encode categorical features using two cache-friendly embedding tables. The first table (token embedding) contains raw embeddings (i.e. learned vector representation), and the second table (weight embedding), which is much smaller, contains weights to scale these raw embeddings to provide better discriminative capability to each data point. We provide a detailed architecture, design and analysis of MEM-REC addressing trade-offs in accuracy and computation requirements, in comparison with state-of-the-art techniques. We show that MEM-REC can not only maintain the recommendation quality and significantly reduce the memory footprint for commercial scale recommendation models but can also improve the embedding latency. In particular, based on our results, MEM-REC compresses the MLPerf CriteoTB benchmark DLRM model size by 2900x and performs up to 3.4x faster embeddings while achieving the same AUC as that of the full uncompressed model.

Mem-Rec: Memory Efficient Recommendation System using Alternative Representation

TL;DR

The paper addresses the heavy memory footprint of embedding tables in DLRM-based recommender systems by introducing MEM-REC, a dual Bloom-filter encoding that compresses categorical features into two small, cache-friendly embedding paths. The final embedding is computed as , where is the Bloom-based token signature and is a data-dependent weight from a second Bloom encoder, enabling memory usage to scale logarithmically with the vocabulary size while preserving accuracy. Empirically, MEM-REC achieves iso-quality compression (no AUC loss) with up to model-size reduction and up to faster embedding computations, while fitting within LLC caches and reducing data movement by large factors. This approach significantly improves the practicality of large-scale, production-grade recommender systems by alleviating memory-bandwidth bottlenecks without compromising predictive performance.

Abstract

Deep learning-based recommendation systems (e.g., DLRMs) are widely used AI models to provide high-quality personalized recommendations. Training data used for modern recommendation systems commonly includes categorical features taking on tens-of-millions of possible distinct values. These categorical tokens are typically assigned learned vector representations, that are stored in large embedding tables, on the order of 100s of GB. Storing and accessing these tables represent a substantial burden in commercial deployments. Our work proposes MEM-REC, a novel alternative representation approach for embedding tables. MEM-REC leverages bloom filters and hashing methods to encode categorical features using two cache-friendly embedding tables. The first table (token embedding) contains raw embeddings (i.e. learned vector representation), and the second table (weight embedding), which is much smaller, contains weights to scale these raw embeddings to provide better discriminative capability to each data point. We provide a detailed architecture, design and analysis of MEM-REC addressing trade-offs in accuracy and computation requirements, in comparison with state-of-the-art techniques. We show that MEM-REC can not only maintain the recommendation quality and significantly reduce the memory footprint for commercial scale recommendation models but can also improve the embedding latency. In particular, based on our results, MEM-REC compresses the MLPerf CriteoTB benchmark DLRM model size by 2900x and performs up to 3.4x faster embeddings while achieving the same AUC as that of the full uncompressed model.
Paper Structure (23 sections, 2 equations, 8 figures, 2 tables)

This paper contains 23 sections, 2 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: DLRM Recommendation Pipeline. An overwhelming majority of the trainable parameters in DLRM come from embedding tables.
  • Figure 2: Architecture of the MEM-REC Model. Irrespective of the number of categorical features, the MEM-REC model creates only two embedding tables with size scaling just logarithmically in the alphabet size.
  • Figure 3: Sparse Feature Encoding Flow in MEM-REC. The token encoder generates raw embeddings and the weight encoder changes the scale of the token encoding to mitigate the effect of hash-collisions
  • Figure 4: Effect of weight encoder on collisions, and embedding latency of a $50000\times128$ size Criteo-TB MEMREC Model. Weight encoder reduces the number of unresolved collisions and helps reduce the memory access latency by prioritizing frequent accesses to the feather-light weight embedding table which fits in the L2 cache.
  • Figure 5: AUC for different values of $k$ at $k^{'}=2$
  • ...and 3 more figures