Table of Contents
Fetching ...

Adaptive Low-Precision Training for Embeddings in Click-Through Rate Prediction

Shiwei Li, Huifeng Guo, Lu Hou, Wei Zhang, Xing Tang, Ruiming Tang, Rui Zhang, Ruixuan Li

TL;DR

The paper tackles the challenge of large embedding tables in CTR models by proposing low-precision training (LPT) to compress embeddings during training. It provides a theoretical convergence analysis showing stochastic rounding outperforms deterministic rounding in LPT, and introduces adaptive low-precision training (ALPT) to learn per-embedding step sizes, enabling effective 8-bit embeddings without accuracy loss. Empirical results on Avazu and Criteo demonstrate that ALPT achieves lossless compression and superior accuracy relative to FP and QAT baselines, with strong scalability across bit widths, embedding dimensions, and feature counts. The work offers practical implications for training-efficient CTR systems and provides publicly available code for broader adoption.

Abstract

Embedding tables are usually huge in click-through rate (CTR) prediction models. To train and deploy the CTR models efficiently and economically, it is necessary to compress their embedding tables at the training stage. To this end, we formulate a novel quantization training paradigm to compress the embeddings from the training stage, termed low-precision training (LPT). Also, we provide theoretical analysis on its convergence. The results show that stochastic weight quantization has a faster convergence rate and a smaller convergence error than deterministic weight quantization in LPT. Further, to reduce the accuracy degradation, we propose adaptive low-precision training (ALPT) that learns the step size (i.e., the quantization resolution) through gradient descent. Experiments on two real-world datasets confirm our analysis and show that ALPT can significantly improve the prediction accuracy, especially at extremely low bit widths. For the first time in CTR models, we successfully train 8-bit embeddings without sacrificing prediction accuracy. The code of ALPT is publicly available.

Adaptive Low-Precision Training for Embeddings in Click-Through Rate Prediction

TL;DR

The paper tackles the challenge of large embedding tables in CTR models by proposing low-precision training (LPT) to compress embeddings during training. It provides a theoretical convergence analysis showing stochastic rounding outperforms deterministic rounding in LPT, and introduces adaptive low-precision training (ALPT) to learn per-embedding step sizes, enabling effective 8-bit embeddings without accuracy loss. Empirical results on Avazu and Criteo demonstrate that ALPT achieves lossless compression and superior accuracy relative to FP and QAT baselines, with strong scalability across bit widths, embedding dimensions, and feature counts. The work offers practical implications for training-efficient CTR systems and provides publicly available code for broader adoption.

Abstract

Embedding tables are usually huge in click-through rate (CTR) prediction models. To train and deploy the CTR models efficiently and economically, it is necessary to compress their embedding tables at the training stage. To this end, we formulate a novel quantization training paradigm to compress the embeddings from the training stage, termed low-precision training (LPT). Also, we provide theoretical analysis on its convergence. The results show that stochastic weight quantization has a faster convergence rate and a smaller convergence error than deterministic weight quantization in LPT. Further, to reduce the accuracy degradation, we propose adaptive low-precision training (ALPT) that learns the step size (i.e., the quantization resolution) through gradient descent. Experiments on two real-world datasets confirm our analysis and show that ALPT can significantly improve the prediction accuracy, especially at extremely low bit widths. For the first time in CTR models, we successfully train 8-bit embeddings without sacrificing prediction accuracy. The code of ALPT is publicly available.
Paper Structure (30 sections, 4 theorems, 37 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 30 sections, 4 theorems, 37 equations, 4 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

$\left[ \text{Theorem 2 in }li2017training \right]$ Assume the learning rate decays like $\eta ^t=\frac{\eta }{\sqrt{t}}$. At the $T$-th iteration, with a fixed step size $\Delta$, for SR in LPT, we have:

Figures (4)

  • Figure 1: The embedding table and neural network paradigm of click-through rate prediction models.
  • Figure 2: Training processes of quantization-aware training (QAT) and low-precision training (LPT).
  • Figure 3: (a), (b) and (c) plot the distributions of the parameters. FP stands for training with full-precision parameters; DR and SR stands for training with DR and SR in LPT, respectively. (d) plots the number of parameters that satisfy $|\eta ^t \nabla f(w^{t})| \textless \frac{\Delta}{2}$ at different iterations of DR.
  • Figure 4: AUC under different learning rates and gradient scaling factors of the step size.

Theorems & Definitions (6)

  • Theorem 1
  • Theorem 2
  • Remark 1
  • Remark 2
  • Lemma 1
  • Lemma 2