Table of Contents
Fetching ...

PreSto: An In-Storage Data Preprocessing System for Training Recommendation Models

Yunjae Lee, Hyeseong Kim, Minsoo Rhu

TL;DR

PreSto is a storage-centric preprocessing system leveraging In-Storage Processing (ISP), which offloads the bottlenecked preprocessing operations to the authors' ISP units, which shows that PreSto outperforms the baseline CPU-centric system with a 9.6× speedup in end-to-end preprocessing time.

Abstract

Training recommendation systems (RecSys) faces several challenges as it requires the "data preprocessing" stage to preprocess an ample amount of raw data and feed them to the GPU for training in a seamless manner. To sustain high training throughput, state-of-the-art solutions reserve a large fleet of CPU servers for preprocessing which incurs substantial deployment cost and power consumption. Our characterization reveals that prior CPU-centric preprocessing is bottlenecked on feature generation and feature normalization operations as it fails to reap out the abundant inter-/intra-feature parallelism in RecSys preprocessing. PreSto is a storage-centric preprocessing system leveraging In-Storage Processing (ISP), which offloads the bottlenecked preprocessing operations to our ISP units. We show that PreSto outperforms the baseline CPU-centric system with a $9.6\times$ speedup in end-to-end preprocessing time, $4.3\times$ enhancement in cost-efficiency, and $11.3\times$ improvement in energyefficiency on average for production-scale RecSys preprocessing.

PreSto: An In-Storage Data Preprocessing System for Training Recommendation Models

TL;DR

PreSto is a storage-centric preprocessing system leveraging In-Storage Processing (ISP), which offloads the bottlenecked preprocessing operations to the authors' ISP units, which shows that PreSto outperforms the baseline CPU-centric system with a 9.6× speedup in end-to-end preprocessing time.

Abstract

Training recommendation systems (RecSys) faces several challenges as it requires the "data preprocessing" stage to preprocess an ample amount of raw data and feed them to the GPU for training in a seamless manner. To sustain high training throughput, state-of-the-art solutions reserve a large fleet of CPU servers for preprocessing which incurs substantial deployment cost and power consumption. Our characterization reveals that prior CPU-centric preprocessing is bottlenecked on feature generation and feature normalization operations as it fails to reap out the abundant inter-/intra-feature parallelism in RecSys preprocessing. PreSto is a storage-centric preprocessing system leveraging In-Storage Processing (ISP), which offloads the bottlenecked preprocessing operations to our ISP units. We show that PreSto outperforms the baseline CPU-centric system with a speedup in end-to-end preprocessing time, enhancement in cost-efficiency, and improvement in energyefficiency on average for production-scale RecSys preprocessing.
Paper Structure (26 sections, 2 equations, 17 figures, 2 tables, 2 algorithms)

This paper contains 26 sections, 2 equations, 17 figures, 2 tables, 2 algorithms.

Figures (17)

  • Figure 1: High-level overview of the end-to-end RecSys training pipeline. In this work, we assume our baseline data storage and ingestion pipeline for data preprocessing by referring to the related academic literature published by Meta dsiscribetectonic_shiftrecd.
  • Figure 2: System architectures for RecSys training. (a) A system that co-locates CPU-based data preprocessing workers with GPU-based model training workers within the same server node. (b) A system that provisions a pool of disaggregated CPU servers for data preprocessing.
  • Figure 3: Effective preprocessing throughput (left axis) and the resulting GPU utilization (right axis) as a function of the number of CPU cores (i.e., number of preprocessing workers) utilized for preprocessing. The dotted line shows the upperbound, maximum training throughput achievable using a single NVIDIA A100 GPU (left axis), which assumes the GPU is seamlessly fed with sufficient amount of train-ready tensors without interruption. To measure GPU's utilization, we use the CUDA Profiling Tools Interface (CUPTI) library. The experiment is collected over the evaluation platform detailed in Section \ref{['sect:methodology']} using our synthetic model RM5.
  • Figure 4: The number of CPU cores required for CPU-centric preprocessing to fully utilize a training node containing 8 A100 GPUs.
  • Figure 5: Latency to preprocess a single mini-batch input using a single preprocessing worker in the baseline CPU-centric system, broken into key steps of preprocessing. The "Extract" stage (Section \ref{['sect:recsys_training_pipeline']}) is further divided into (1) latency to fetch encoded raw feature data from the remote storage node (denoted as "Extract (Read)") and (2) latency spent decoding them (denoted as "Extract (Decode)"). As depicted, data preprocessing is bounded by the compute-intensive feature generation and normalization operations, rather than I/O operations ("Extract (Read)"). All results are normalized to RM1.
  • ...and 12 more figures