Table of Contents
Fetching ...

ElasticRec: A Microservice-based Model Serving Architecture Enabling Elastic Resource Scaling for Recommendation Models

Yujeong Choi, Jiin Kim, Minsoo Rhu

TL;DR

ElasticRec tackles the inefficiency of model-wise resource allocation in RecSys serving by introducing a microservice-based architecture that partitions models into fine-grained shards and employs a utility-based embedding allocation guided by DP-based table partitioning. This design enables selective replication of hot embeddings and bottlenecked dense layers, achieving substantial memory savings ($3.3\times$) and memory utilization gains ($8.1\times$) with reduced deployment costs ($1.6\times$) across CPU-only and CPU-GPU setups. The approach leverages bucketization and Kubernetes autoscaling to adapt to dynamic query traffic, providing practical elasticity for large embedding tables and heterogeneous compute demands. The results indicate that ElasticRec significantly improves fleet-wide QPS and reduces memory footprint, making elastic, cost-effective RecSys serving feasible at scale.

Abstract

With the increasing popularity of recommendation systems (RecSys), the demand for compute resources in datacenters has surged. However, the model-wise resource allocation employed in current RecSys model serving architectures falls short in effectively utilizing resources, leading to sub-optimal total cost of ownership. We propose ElasticRec, a model serving architecture for RecSys providing resource elasticity and high memory efficiency. ElasticRec is based on a microservice-based software architecture for fine-grained resource allocation, tailored to the heterogeneous resource demands of RecSys. Additionally, ElasticRec achieves high memory efficiency via our utility-based resource allocation. Overall, ElasticRec achieves an average 3.3x reduction in memory allocation size and 8.1x increase in memory utility, resulting in an average 1.6x reduction in deployment cost compared to state-of-the-art RecSys inference serving system.

ElasticRec: A Microservice-based Model Serving Architecture Enabling Elastic Resource Scaling for Recommendation Models

TL;DR

ElasticRec tackles the inefficiency of model-wise resource allocation in RecSys serving by introducing a microservice-based architecture that partitions models into fine-grained shards and employs a utility-based embedding allocation guided by DP-based table partitioning. This design enables selective replication of hot embeddings and bottlenecked dense layers, achieving substantial memory savings () and memory utilization gains () with reduced deployment costs () across CPU-only and CPU-GPU setups. The approach leverages bucketization and Kubernetes autoscaling to adapt to dynamic query traffic, providing practical elasticity for large embedding tables and heterogeneous compute demands. The results indicate that ElasticRec significantly improves fleet-wide QPS and reduces memory footprint, making elastic, cost-effective RecSys serving feasible at scale.

Abstract

With the increasing popularity of recommendation systems (RecSys), the demand for compute resources in datacenters has surged. However, the model-wise resource allocation employed in current RecSys model serving architectures falls short in effectively utilizing resources, leading to sub-optimal total cost of ownership. We propose ElasticRec, a model serving architecture for RecSys providing resource elasticity and high memory efficiency. ElasticRec is based on a microservice-based software architecture for fine-grained resource allocation, tailored to the heterogeneous resource demands of RecSys. Additionally, ElasticRec achieves high memory efficiency via our utility-based resource allocation. Overall, ElasticRec achieves an average 3.3x reduction in memory allocation size and 8.1x increase in memory utility, resulting in an average 1.6x reduction in deployment cost compared to state-of-the-art RecSys inference serving system.
Paper Structure (25 sections, 20 figures, 2 tables, 2 algorithms)

This paper contains 25 sections, 20 figures, 2 tables, 2 algorithms.

Figures (20)

  • Figure 1: A modern DNN-based RecSys model architecture.
  • Figure 2: (a) A containerized ML inference server using model-wise resource allocation, and (b) using Kubernetes to scale out multiple server replicas across the datacenter to meet a target QPS goal.
  • Figure 3: The fraction of (a) FLOPs, memory consumption and (b) end-to-end inference latency (over CPU-only and CPU-GPU systems) the sparse embedding and dense DNN layers account for when evaluated over the three models studied in this paper (RM1, RM2, and RM3). FLOPs and memory consumption are architecture-independent, so its values are identical over CPU-only and CPU-GPU systems. Section \ref{['sect:methodology']} details our methodology. FLOPS percentage of sparse embedding layers in (a) are 2%, 1%, and 0.1% for RM1, RM2, and RM3, respectively. Memory consumption percentage of dense DNN layers in (a) are 0.02%, 0.02%, and 0.4% for RM1, RM2, and RM3, respectively.
  • Figure 4: An example RecSys where the dense DNN layer exhibits half the QPS than the sparse embedding layer. (a) How the baseline model-wise resource allocation would replicate two servers to reach $100$ queries/sec and (b) how our proposed ElasticRec would reach such QPS goal using fine-grained, per-layer resource allocation.
  • Figure 5: Service throughput (QPS) of dense DNN and sparse embedding layers over (a) CPU-only and (b) CPU-GPU system when separately measured over the three RecSys models used in our evaluation (see Table \ref{['tab:benchmark']}). As shown, due to the heterogeneous resource demands of RecSys, a significant QPS mismatch exists between sparse and dense layers, for both CPU-only and CPU-GPU system.
  • ...and 15 more figures