Table of Contents
Fetching ...

Disaggregating Embedding Recommendation Systems with FlexEMR

Yibo Huang, Zhenning Yang, Jiarong Xing, Yi Dai, Yiming Qiu, Dingming Wu, Fan Lai, Ang Chen

TL;DR

This work addresses the memory and cost inefficiencies of embedding-based recommendation (EMR) models by proposing FlexEMR, a disaggregated EMR serving system that separates embedding storage from NN compute. It introduces two core strategies: locality-enhanced disaggregation (adaptive EMB caching and hierarchical pooling) to minimize network traffic and GPU contention, and a multi-threaded, mapping-aware RDMA engine with live migration and credit-based flow control to accelerate remote lookups. The paper presents a system design and early prototype results showing improved throughput and reduced tail latency, suggesting significant potential for better resource utilization and lower TCO in large-scale EMR deployments. If proven scalable, FlexEMR could transform EMR serving and inspire similar disaggregation approaches for other large-scale ML workloads, including LLMs and MoE models.

Abstract

Efficiently serving embedding-based recommendation (EMR) models remains a significant challenge due to their increasingly large memory requirements. Today's practice splits the model across many monolithic servers, where a mix of GPUs, CPUs, and DRAM is provisioned in fixed proportions. This approach leads to suboptimal resource utilization and increased costs. Disaggregating embedding operations from neural network inference is a promising solution but raises novel networking challenges. In this paper, we discuss the design of FlexEMR for optimized EMR disaggregation. FlexEMR proposes two sets of techniques to tackle the networking challenges: Leveraging the temporal and spatial locality of embedding lookups to reduce data movement over the network, and designing an optimized multi-threaded RDMA engine for concurrent lookup subrequests. We outline the design space for each technique and present initial results from our early prototype.

Disaggregating Embedding Recommendation Systems with FlexEMR

TL;DR

This work addresses the memory and cost inefficiencies of embedding-based recommendation (EMR) models by proposing FlexEMR, a disaggregated EMR serving system that separates embedding storage from NN compute. It introduces two core strategies: locality-enhanced disaggregation (adaptive EMB caching and hierarchical pooling) to minimize network traffic and GPU contention, and a multi-threaded, mapping-aware RDMA engine with live migration and credit-based flow control to accelerate remote lookups. The paper presents a system design and early prototype results showing improved throughput and reduced tail latency, suggesting significant potential for better resource utilization and lower TCO in large-scale EMR deployments. If proven scalable, FlexEMR could transform EMR serving and inspire similar disaggregation approaches for other large-scale ML workloads, including LLMs and MoE models.

Abstract

Efficiently serving embedding-based recommendation (EMR) models remains a significant challenge due to their increasingly large memory requirements. Today's practice splits the model across many monolithic servers, where a mix of GPUs, CPUs, and DRAM is provisioned in fixed proportions. This approach leads to suboptimal resource utilization and increased costs. Disaggregating embedding operations from neural network inference is a promising solution but raises novel networking challenges. In this paper, we discuss the design of FlexEMR for optimized EMR disaggregation. FlexEMR proposes two sets of techniques to tackle the networking challenges: Leveraging the temporal and spatial locality of embedding lookups to reduce data movement over the network, and designing an optimized multi-threaded RDMA engine for concurrent lookup subrequests. We outline the design space for each technique and present initial results from our early prototype.

Paper Structure

This paper contains 14 sections, 8 figures.

Figures (8)

  • Figure 1: A representative EMR model---Deep Learning Recommendation Model (DLRM).
  • Figure 2: Embedding layer dominates EMR serving.
  • Figure 3: FlexEMR architecture overview.
  • Figure 4: Hierarchical EMB pooling. Pooling computation handled solely by the ranker can cause network contention (\ref{['fig:offloading']}a). Performing pooling hierarchically, sending only the intermediate results to the ranker can reduce network traffic (\ref{['fig:offloading']}b).
  • Figure 5: Distribution of inference workloads in Alibaba PAI platform over one week.
  • ...and 3 more figures