Table of Contents
Fetching ...

Accelerating Recommendation System Training by Leveraging Popular Choices

Muhammad Adnan, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, Prashant J. Nair

TL;DR

FAE addresses the bottleneck of large embedding tables in recommender training by exploiting highly skewed access patterns to keep hot embeddings on GPUs and cold embeddings on CPUs. A static preprocessing phase estimates a hot embedding threshold using input sampling and a statistical optimizer, while a runtime Shuffle Scheduler interleaves hot and cold mini-batches to preserve convergence. The framework is validated on real-world models (DLRM, TBSM) across multiple datasets, achieving up to 2.3× speedups over CPU-only baselines and substantial reductions in CPU-GPU data transfer with preserved accuracy. This work demonstrates a practical, memory-efficient path to GPU-accelerated training for large-scale recommenders, with notable gains in throughput and energy efficiency.

Abstract

Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items' and users' categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation models is evolving to require increasing data and compute resources. The highly parallel neural networks portion of these models can benefit from GPU acceleration however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this paper deep dives into the semantics of training data and obtains insights about the feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed with a few embedding entries being accessed up to 10000x more. This paper leverages this asymmetrical access pattern to offer a framework, called FAE, and proposes a hot-embedding aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reduces the data transfers from CPU to GPU. At the same time, FAE engages the GPU to accelerate the executions of these hot embedding entries. Experiments on production-scale recommendation models with real datasets show that FAE reduces the overall training time by 2.3x and 1.52x in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy

Accelerating Recommendation System Training by Leveraging Popular Choices

TL;DR

FAE addresses the bottleneck of large embedding tables in recommender training by exploiting highly skewed access patterns to keep hot embeddings on GPUs and cold embeddings on CPUs. A static preprocessing phase estimates a hot embedding threshold using input sampling and a statistical optimizer, while a runtime Shuffle Scheduler interleaves hot and cold mini-batches to preserve convergence. The framework is validated on real-world models (DLRM, TBSM) across multiple datasets, achieving up to 2.3× speedups over CPU-only baselines and substantial reductions in CPU-GPU data transfer with preserved accuracy. This work demonstrates a practical, memory-efficient path to GPU-accelerated training for large-scale recommenders, with notable gains in throughput and energy efficiency.

Abstract

Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items' and users' categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation models is evolving to require increasing data and compute resources. The highly parallel neural networks portion of these models can benefit from GPU acceleration however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this paper deep dives into the semantics of training data and obtains insights about the feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed with a few embedding entries being accessed up to 10000x more. This paper leverages this asymmetrical access pattern to offer a framework, called FAE, and proposes a hot-embedding aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reduces the data transfers from CPU to GPU. At the same time, FAE engages the GPU to accelerate the executions of these hot embedding entries. Experiments on production-scale recommendation models with real datasets show that FAE reduces the overall training time by 2.3x and 1.52x in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy

Paper Structure

This paper contains 27 sections, 5 equations, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Typical recommender model dlrmneuralcftbsm. They comprise compute-intensive neural networks like DNNs and MLPs in tandem with the memory-intensive embedding tables. shows embedding table sizes for four real world datasets and the proportion of the embedding table that is frequently accessed (hot). The graph also shows the % of training inputs that only access the hot embeddings. shows the baseline embedding data layout, i.e., storing entirely in the main memory. shows the proposed layout where hot embeddings that cater to >70% of the training inputs, are stored locally on GPUs.
  • Figure 2: Execution graph of deep learning based recommender model. In this graph we show the forward graph in detail, the backward pass is a mirror of forward and executes on CPU and GPU according to its forward counterpart. The current mode of training for DLRM and TBSM requires embedding storage, reading, and processing, on CPU.
  • Figure 3: Probability of creating a mini-batch with all popular inputs when the number of hot-inputs is 99% or lower. This reduces drastically as the mini-batch size increases.
  • Figure 4: The FAE framework. The pre-processing phase calculates the threshold for classifying hot embeddings. This phase uses random-sampling of input datasets and embedding tables to determine the best threshold for hot embeddings. This threshold is also used to classify inputs into hot and cold mini-batches. At runtime, GPUs execute the hot input mini-batch while cold inputs execute in a CPU-GPU hybrid mode. The Shuffle Scheduler uses feedback from the pytorch modules to determine the rate of hot and cold mini-batches swap.
  • Figure 5: (a) Size of hot embedding entries and (b) Percentage of hot inputs with varying access threshold values. As we vary the threshold, the size of the embedding entries increases more rapidly compared to the percent of hot inputs
  • ...and 11 more figures