Table of Contents
Fetching ...

Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory

Jie Ren, Bin Ma, Shuangyan Yang, Benjamin Francis, Ehsan K. Ardestani, Min Si, Dong Li

TL;DR

This work tackles embedding-table memory pressure in industrial DLRM inference on tiered memory by introducing RecMG, a dual-ML system that learns both caching for temporal locality and prefetching for irregular accesses. It uses offline Belady-based ground truth and a differentiable Chamfer-distance loss to train the two models, achieving substantial reductions in on-demand fetches and end-to-end latency (up to ~43%). RecMG significantly outperforms state-of-the-art rule-based and ML-based prefetchers as well as LRU caching in production-like traces, without requiring changes to the DLRM model itself. The approach demonstrates practical ML-guided memory tiering for large embedding workloads, offering scalable improvements for high-cardinality categorical features in real-world recommender systems.

Abstract

Deep learning recommendation models (DLRMs) are widely used in industry, and their memory capacity requirements reach the terabyte scale. Tiered memory architectures provide a cost-effective solution but introduce challenges in embedding-vector placement due to complex embedding-access patterns. We propose RecMG, a machine learning (ML)-guided system for vector caching and prefetching on tiered memory. RecMG accurately predicts accesses to embedding vectors with long reuse distances or few reuses. The design of RecMG focuses on making ML feasible in the context of DLRM inference by addressing unique challenges in data labeling and navigating the search space for embedding-vector placement. By employing separate ML models for caching and prefetching, plus a novel differentiable loss function, RecMG narrows the prefetching search space and minimizes on-demand fetches. Compared to state-of-the-art temporal, spatial, and ML-based prefetchers, RecMG reduces on-demand fetches by 2.2x, 2.8x, and 1.5x, respectively. In industrial-scale DLRM inference scenarios, RecMG effectively reduces end-to-end DLRM inference time by up to 43%.

Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory

TL;DR

This work tackles embedding-table memory pressure in industrial DLRM inference on tiered memory by introducing RecMG, a dual-ML system that learns both caching for temporal locality and prefetching for irregular accesses. It uses offline Belady-based ground truth and a differentiable Chamfer-distance loss to train the two models, achieving substantial reductions in on-demand fetches and end-to-end latency (up to ~43%). RecMG significantly outperforms state-of-the-art rule-based and ML-based prefetchers as well as LRU caching in production-like traces, without requiring changes to the DLRM model itself. The approach demonstrates practical ML-guided memory tiering for large embedding workloads, offering scalable improvements for high-cardinality categorical features in real-world recommender systems.

Abstract

Deep learning recommendation models (DLRMs) are widely used in industry, and their memory capacity requirements reach the terabyte scale. Tiered memory architectures provide a cost-effective solution but introduce challenges in embedding-vector placement due to complex embedding-access patterns. We propose RecMG, a machine learning (ML)-guided system for vector caching and prefetching on tiered memory. RecMG accurately predicts accesses to embedding vectors with long reuse distances or few reuses. The design of RecMG focuses on making ML feasible in the context of DLRM inference by addressing unique challenges in data labeling and navigating the search space for embedding-vector placement. By employing separate ML models for caching and prefetching, plus a novel differentiable loss function, RecMG narrows the prefetching search space and minimizes on-demand fetches. Compared to state-of-the-art temporal, spatial, and ML-based prefetchers, RecMG reduces on-demand fetches by 2.2x, 2.8x, and 1.5x, respectively. In industrial-scale DLRM inference scenarios, RecMG effectively reduces end-to-end DLRM inference time by up to 43%.

Paper Structure

This paper contains 20 sections, 5 equations, 19 figures, 6 tables, 2 algorithms.

Figures (19)

  • Figure 1: DLRM architecture on tiered memory. Embeddings map the categorical features into dense representations.
  • Figure 2: Embedding tables and pooling factor.
  • Figure 3: Reuse distance of embedding-vector accesses in 856 sparse features.
  • Figure 4: Design overview of RecMG.
  • Figure 5: The architecture of (a) caching and (b) prefetch models. The dashed rectangle represents one LSTM stack. "E" and "D" stand for encoder and decoder in LSTM respectively.
  • ...and 14 more figures