Table of Contents
Fetching ...

Mixed-Precision Embeddings for Large-Scale Recommendation Models

Shiwei Li, Zhuoqi Hu, Xing Tang, Haozhao Wang, Shijie Xu, Weihong Luo, Yuhua Li, Xiuqiang He, Ruixuan Li

TL;DR

This paper proposes a novel embedding compression method, termed Mixed-Precision Embeddings (MPE), which first group features by frequency and then search precision for each feature group, to reduce the size of the search space.

Abstract

Embedding techniques have become essential components of large databases in the deep learning era. By encoding discrete entities, such as words, items, or graph nodes, into continuous vector spaces, embeddings facilitate more efficient storage, retrieval, and processing in large databases. Especially in the domain of recommender systems, millions of categorical features are encoded as unique embedding vectors, which facilitates the modeling of similarities and interactions among features. However, numerous embedding vectors can result in significant storage overhead. In this paper, we aim to compress the embedding table through quantization techniques. Given that features vary in importance levels, we seek to identify an appropriate precision for each feature to balance model accuracy and memory usage. To this end, we propose a novel embedding compression method, termed Mixed-Precision Embeddings (MPE). Specifically, to reduce the size of the search space, we first group features by frequency and then search precision for each feature group. MPE further learns the probability distribution over precision levels for each feature group, which can be used to identify the most suitable precision with a specially designed sampling strategy. Extensive experiments on three public datasets demonstrate that MPE significantly outperforms existing embedding compression methods. Remarkably, MPE achieves about 200x compression on the Criteo dataset without comprising the prediction accuracy.

Mixed-Precision Embeddings for Large-Scale Recommendation Models

TL;DR

This paper proposes a novel embedding compression method, termed Mixed-Precision Embeddings (MPE), which first group features by frequency and then search precision for each feature group, to reduce the size of the search space.

Abstract

Embedding techniques have become essential components of large databases in the deep learning era. By encoding discrete entities, such as words, items, or graph nodes, into continuous vector spaces, embeddings facilitate more efficient storage, retrieval, and processing in large databases. Especially in the domain of recommender systems, millions of categorical features are encoded as unique embedding vectors, which facilitates the modeling of similarities and interactions among features. However, numerous embedding vectors can result in significant storage overhead. In this paper, we aim to compress the embedding table through quantization techniques. Given that features vary in importance levels, we seek to identify an appropriate precision for each feature to balance model accuracy and memory usage. To this end, we propose a novel embedding compression method, termed Mixed-Precision Embeddings (MPE). Specifically, to reduce the size of the search space, we first group features by frequency and then search precision for each feature group. MPE further learns the probability distribution over precision levels for each feature group, which can be used to identify the most suitable precision with a specially designed sampling strategy. Extensive experiments on three public datasets demonstrate that MPE significantly outperforms existing embedding compression methods. Remarkably, MPE achieves about 200x compression on the Criteo dataset without comprising the prediction accuracy.
Paper Structure (28 sections, 11 equations, 6 figures, 4 tables)

This paper contains 28 sections, 11 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The typical architecture of deep learning recommendation models (DLRMs) and the schematic diagram of quantization-aware training (QAT). (a) DLRMs are usually composed of an embedding layer and a feature interaction network. (b) A fake quantizer will be inserted into the forward propagation of QAT, and end-to-end optimization is then achieved through Straight-Through Estimator (STE), which treats the quantizer as an identity map during backpropagation.
  • Figure 2: Learning process of the probability distribution over candidate bit-widths in MPE. $x$ is a feature of the input data, and $k$ is the corresponding group index when sorted by feature frequency.
  • Figure 3: AUC under varying compression ratios.
  • Figure 4: Transferability analysis. Each column contains the test AUC of a specific target model with different source models.
  • Figure 5: Inference latency of the DNN model using different compression methods.
  • ...and 1 more figures