Table of Contents
Fetching ...

Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs

Rishabh Jain, Vivek M. Bhasi, Adwait Jog, Anand Sivasubramaniam, Mahmut T. Kandemir, Chita R. Das

TL;DR

This paper shows that the embedding stage continues to be the primary bottleneck in the GPU inference pipeline, leading up to a 3.2 x embedding-only performance slowdown, and proposes plug-and-play-based software prefetching and L2 pinning techniques, which help in hiding and decreasing the latencies.

Abstract

Personalized recommendation is a ubiquitous application on the internet, with many industries and hyperscalers extensively leveraging Deep Learning Recommendation Models (DLRMs) for their personalization needs (like ad serving or movie suggestions). With growing model and dataset sizes pushing computation and memory requirements, GPUs are being increasingly preferred for executing DLRM inference. However, serving newer DLRMs, while meeting acceptable latencies, continues to remain challenging, making traditional deployments increasingly more GPU-hungry, resulting in higher inference serving costs. In this paper, we show that the embedding stage continues to be the primary bottleneck in the GPU inference pipeline, leading up to a 3.2x embedding-only performance slowdown. To thoroughly grasp the problem, we conduct a detailed microarchitecture characterization and highlight the presence of low occupancy in the standard embedding kernels. By leveraging direct compiler optimizations, we achieve optimal occupancy, pushing the performance by up to 53%. Yet, long memory latency stalls continue to exist. To tackle this challenge, we propose specialized plug-and-play-based software prefetching and L2 pinning techniques, which help in hiding and decreasing the latencies. Further, we propose combining them, as they complement each other. Experimental evaluations using A100 GPUs with large models and datasets show that our proposed techniques improve performance by up to 103% for the embedding stage, and up to 77% for the overall DLRM inference pipeline.

Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs

TL;DR

This paper shows that the embedding stage continues to be the primary bottleneck in the GPU inference pipeline, leading up to a 3.2 x embedding-only performance slowdown, and proposes plug-and-play-based software prefetching and L2 pinning techniques, which help in hiding and decreasing the latencies.

Abstract

Personalized recommendation is a ubiquitous application on the internet, with many industries and hyperscalers extensively leveraging Deep Learning Recommendation Models (DLRMs) for their personalization needs (like ad serving or movie suggestions). With growing model and dataset sizes pushing computation and memory requirements, GPUs are being increasingly preferred for executing DLRM inference. However, serving newer DLRMs, while meeting acceptable latencies, continues to remain challenging, making traditional deployments increasingly more GPU-hungry, resulting in higher inference serving costs. In this paper, we show that the embedding stage continues to be the primary bottleneck in the GPU inference pipeline, leading up to a 3.2x embedding-only performance slowdown. To thoroughly grasp the problem, we conduct a detailed microarchitecture characterization and highlight the presence of low occupancy in the standard embedding kernels. By leveraging direct compiler optimizations, we achieve optimal occupancy, pushing the performance by up to 53%. Yet, long memory latency stalls continue to exist. To tackle this challenge, we propose specialized plug-and-play-based software prefetching and L2 pinning techniques, which help in hiding and decreasing the latencies. Further, we propose combining them, as they complement each other. Experimental evaluations using A100 GPUs with large models and datasets show that our proposed techniques improve performance by up to 103% for the embedding stage, and up to 77% for the overall DLRM inference pipeline.

Paper Structure

This paper contains 41 sections, 19 figures, 9 tables, 1 algorithm.

Figures (19)

  • Figure 1: Shown is the degradation in inference performance as hotness lowers (working footprint decreases) from left to right. The numbers inside the bars indicate the embedding stage contributions. Here, OptMT provides higher WLP which enhances performance over off-the-shelf PyTorch (base). Yet, a significant gap continues to exist compared to the fastest loads (one item case). We cite this as the research gap.
  • Figure 2: A schematic of a DLRM architecture. The continuous features (e.g., age, location) are processed by Bottom MLP, and categorical features (e.g., movie genre, item ID) by the Embedding Stage. Their outputs are combined in the Feature Interaction Stage, and then fed into the Top MLP, which predicts top-k items with highest Click Through Rate (CTR).
  • Figure 3: Simplified Nvidia A100 GPU organization.
  • Figure 4: Parallel implementation of embedding stage by work partitioning across CUDA threads. Here, 1000s of CUDA threads independently work on one output matrix element.
  • Figure 5: Coverage study for different memory access patterns: it shows the % of total accesses (y axis) that are covered by the % of unique accesses (x axis).
  • ...and 14 more figures