Accelerating Recommender Model Training by Dynamically Skipping Stale Embeddings

Yassaman Ebrahimzadeh Maboud; Muhammad Adnan; Divya Mahajan; Prashant J. Nair

Accelerating Recommender Model Training by Dynamically Skipping Stale Embeddings

Yassaman Ebrahimzadeh Maboud, Muhammad Adnan, Divya Mahajan, Prashant J. Nair

TL;DR

Slipstream presents a runtime framework to accelerate training of large-scale recommender models by dynamically skipping updates to stale embeddings. It uses a three-stage approach—Snapshotting hot embeddings, sampling to identify a skip threshold, and an input classifier to omit stale-input updates—augmented with feature normalization to recover accuracy. Across four public datasets and standard recommender models, Slipstream achieves about $2\times$ to $2.5\times$ training-time speedups with minor to positive accuracy changes and low overhead, and it remains complementary to hardware accelerators like Hotline. The work offers a practical, data-aware method to reduce CPU-GPU bandwidth and memory traffic in commercial settings, potentially enabling higher throughput in production recommender systems.

Abstract

Training recommendation models pose significant challenges regarding resource utilization and performance. Prior research has proposed an approach that categorizes embeddings into popular and non-popular classes to reduce the training time for recommendation models. We observe that, even among the popular embeddings, certain embeddings undergo rapid training and exhibit minimal subsequent variation, resulting in saturation. Consequently, updates to these embeddings lack any contribution to model quality. This paper presents Slipstream, a software framework that identifies stale embeddings on the fly and skips their updates to enhance performance. This capability enables Slipstream to achieve substantial speedup, optimize CPU-GPU bandwidth usage, and eliminate unnecessary memory access. SlipStream showcases training time reductions of 2x, 2.4x, 1.2x, and 1.175x across real-world datasets and configurations, compared to Baseline XDL, Intel-optimized DRLM, FAE, and Hotline, respectively.

Accelerating Recommender Model Training by Dynamically Skipping Stale Embeddings

TL;DR

training-time speedups with minor to positive accuracy changes and low overhead, and it remains complementary to hardware accelerators like Hotline. The work offers a practical, data-aware method to reduce CPU-GPU bandwidth and memory traffic in commercial settings, potentially enabling higher throughput in production recommender systems.

Abstract

Paper Structure (45 sections, 8 equations, 13 figures, 8 tables, 1 algorithm)

This paper contains 45 sections, 8 equations, 13 figures, 8 tables, 1 algorithm.

Introduction
Background and Challenges
A Overview of Recommendation Systems
Training Setup
Motivation: Data-Aware Embedding Updates
Breakdown of Training Time
'Hot' Embeddings and Skewed Access Patterns
Embedding Value Saturation
Challenges: Identification of Stale Embeddings
Capturing Embedding Variations
Determining Which Updates to Skip
Design: The Slipstream Framework
Efficient Snapshots with 'Hot' Embeddings
Warmup Period
Embedding Snapshots
...and 30 more sections

Figures (13)

Figure 1: The Deep Learning Recommendation Model (DLRM) consists of compute-intensive Multi-Layer Perceptrons (MLPs) and memory-intensive embedding lookup operations. Due to the large embedding tables and skewed accesses, numerous embedding entries are rapidly trained and remain stagnant throughout the training process.
Figure 2: The breakdown of the training time for an Intel-optimized DLRM with 4-GPU in a hybrid CPU-GPU training setup. We observe that a significant fraction of the time is spent on forward embedding pass, embedding updates in the optimizer, and communication.
Figure 3: Access frequency to the largest embedding table during a single training epoch. This skewed access categorizes embeddings into 'hot' and 'cold.' The x-axis shows embedding indices in millions.
Figure 4: The temporal difference in values for ten randomly selected 'hot' embeddings for RM2 (Criteo Kaggle), RM3 (Criteo Terabyte), and RM4 (Avazu) recommendation models. As 'hot' embeddings account for a significant fraction of accesses, they tend to saturate quickly -- in under 25% of the training iterations. This experiment uses DLRM dlrm for the training process.
Figure 5: Impact on testing accuracy when completely skipping cold or hot embedding updates compared to a baseline DLRM implementation. This representative analysis uses RM2 (Criteo Kaggle) and RM3 (Criteo Terabyte). Thus, we observe that a naive approach of skipping 'hot' or 'cold' embeddings can cause a significant accuracy loss of 4-6%.
...and 8 more figures

Accelerating Recommender Model Training by Dynamically Skipping Stale Embeddings

TL;DR

Abstract

Accelerating Recommender Model Training by Dynamically Skipping Stale Embeddings

Authors

TL;DR

Abstract

Table of Contents

Figures (13)