Table of Contents
Fetching ...

Accelerating Recommender Model Training by Dynamically Skipping Stale Embeddings

Yassaman Ebrahimzadeh Maboud, Muhammad Adnan, Divya Mahajan, Prashant J. Nair

TL;DR

Slipstream presents a runtime framework to accelerate training of large-scale recommender models by dynamically skipping updates to stale embeddings. It uses a three-stage approach—Snapshotting hot embeddings, sampling to identify a skip threshold, and an input classifier to omit stale-input updates—augmented with feature normalization to recover accuracy. Across four public datasets and standard recommender models, Slipstream achieves about $2\times$ to $2.5\times$ training-time speedups with minor to positive accuracy changes and low overhead, and it remains complementary to hardware accelerators like Hotline. The work offers a practical, data-aware method to reduce CPU-GPU bandwidth and memory traffic in commercial settings, potentially enabling higher throughput in production recommender systems.

Abstract

Training recommendation models pose significant challenges regarding resource utilization and performance. Prior research has proposed an approach that categorizes embeddings into popular and non-popular classes to reduce the training time for recommendation models. We observe that, even among the popular embeddings, certain embeddings undergo rapid training and exhibit minimal subsequent variation, resulting in saturation. Consequently, updates to these embeddings lack any contribution to model quality. This paper presents Slipstream, a software framework that identifies stale embeddings on the fly and skips their updates to enhance performance. This capability enables Slipstream to achieve substantial speedup, optimize CPU-GPU bandwidth usage, and eliminate unnecessary memory access. SlipStream showcases training time reductions of 2x, 2.4x, 1.2x, and 1.175x across real-world datasets and configurations, compared to Baseline XDL, Intel-optimized DRLM, FAE, and Hotline, respectively.

Accelerating Recommender Model Training by Dynamically Skipping Stale Embeddings

TL;DR

Slipstream presents a runtime framework to accelerate training of large-scale recommender models by dynamically skipping updates to stale embeddings. It uses a three-stage approach—Snapshotting hot embeddings, sampling to identify a skip threshold, and an input classifier to omit stale-input updates—augmented with feature normalization to recover accuracy. Across four public datasets and standard recommender models, Slipstream achieves about to training-time speedups with minor to positive accuracy changes and low overhead, and it remains complementary to hardware accelerators like Hotline. The work offers a practical, data-aware method to reduce CPU-GPU bandwidth and memory traffic in commercial settings, potentially enabling higher throughput in production recommender systems.

Abstract

Training recommendation models pose significant challenges regarding resource utilization and performance. Prior research has proposed an approach that categorizes embeddings into popular and non-popular classes to reduce the training time for recommendation models. We observe that, even among the popular embeddings, certain embeddings undergo rapid training and exhibit minimal subsequent variation, resulting in saturation. Consequently, updates to these embeddings lack any contribution to model quality. This paper presents Slipstream, a software framework that identifies stale embeddings on the fly and skips their updates to enhance performance. This capability enables Slipstream to achieve substantial speedup, optimize CPU-GPU bandwidth usage, and eliminate unnecessary memory access. SlipStream showcases training time reductions of 2x, 2.4x, 1.2x, and 1.175x across real-world datasets and configurations, compared to Baseline XDL, Intel-optimized DRLM, FAE, and Hotline, respectively.
Paper Structure (45 sections, 8 equations, 13 figures, 8 tables, 1 algorithm)

This paper contains 45 sections, 8 equations, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: The Deep Learning Recommendation Model (DLRM) consists of compute-intensive Multi-Layer Perceptrons (MLPs) and memory-intensive embedding lookup operations. Due to the large embedding tables and skewed accesses, numerous embedding entries are rapidly trained and remain stagnant throughout the training process.
  • Figure 2: The breakdown of the training time for an Intel-optimized DLRM with 4-GPU in a hybrid CPU-GPU training setup. We observe that a significant fraction of the time is spent on forward embedding pass, embedding updates in the optimizer, and communication.
  • Figure 3: Access frequency to the largest embedding table during a single training epoch. This skewed access categorizes embeddings into 'hot' and 'cold.' The x-axis shows embedding indices in millions.
  • Figure 4: The temporal difference in values for ten randomly selected 'hot' embeddings for RM2 (Criteo Kaggle), RM3 (Criteo Terabyte), and RM4 (Avazu) recommendation models. As 'hot' embeddings account for a significant fraction of accesses, they tend to saturate quickly -- in under 25% of the training iterations. This experiment uses DLRM dlrm for the training process.
  • Figure 5: Impact on testing accuracy when completely skipping cold or hot embedding updates compared to a baseline DLRM implementation. This representative analysis uses RM2 (Criteo Kaggle) and RM3 (Criteo Terabyte). Thus, we observe that a naive approach of skipping 'hot' or 'cold' embeddings can cause a significant accuracy loss of 4-6%.
  • ...and 8 more figures