Table of Contents
Fetching ...

Efficient Tabular Data Preprocessing of ML Pipelines

Yu Zhu, Wenqi Jiang, Gustavo Alonso

TL;DR

This work tackles the preprocessing bottleneck in ML pipelines caused by the CPU–GPU performance gap. It introduces Piper, a network-attached FPGA accelerator that implements a column-wise, streaming dataflow with specialized PEs and a parallel UTF-8 decoding unit to accelerate stateful tabular preprocessing, including embedding generation. Across production DLRMs from Meta and Google, Piper delivers substantial gains, up to $4.7\sim71.3\times$ speedups over a 128-core CPU and up to $4.8\sim20.3\times$ over a GPU when handling binary inputs, with network-attached configurations offering the best end-to-end performance. The approach reduces resource and energy consumption while enabling scalable deployments by decoupling preprocessing from training and supporting streaming data processing. Overall, Piper demonstrates a practical path to more efficient end-to-end ML training in data centers by offloading expensive tabular preprocessing to specialized hardware.

Abstract

Data preprocessing pipelines, which includes data decoding, cleaning, and transforming, are a crucial component of Machine Learning (ML) training. Thy are computationally intensive and often become a major bottleneck, due to the increasing performance gap between the CPUs used for preprocessing and the GPUs used for model training. Recent studies show that a significant number of CPUs across several machines are required to achieve sufficient throughput to saturate the GPUs, leading to increased resource and energy consumption. When the pipeline involves vocabulary generation, the preprocessing performance scales poorly due to significant row-wise synchronization overhead between different CPU cores and servers. To address this limitation, in this paper we present the design of Piper, a hardware accelerator for tabular data preprocessing, prototype it on FPGAs, and demonstrate its potential for training pipelines of commercial recommender systems. Piper achieves 4.7 $\sim$ 71.3$\times$ speedup in latency over a 128-core CPU server and outperforms a data-center GPU by 4.8$\sim$ 20.3$\times$ when using binary input. The impressive performance showcases Piper's potential to increase the efficiency of data preprocessing pipelines and significantly reduce their resource consumption.

Efficient Tabular Data Preprocessing of ML Pipelines

TL;DR

This work tackles the preprocessing bottleneck in ML pipelines caused by the CPU–GPU performance gap. It introduces Piper, a network-attached FPGA accelerator that implements a column-wise, streaming dataflow with specialized PEs and a parallel UTF-8 decoding unit to accelerate stateful tabular preprocessing, including embedding generation. Across production DLRMs from Meta and Google, Piper delivers substantial gains, up to speedups over a 128-core CPU and up to over a GPU when handling binary inputs, with network-attached configurations offering the best end-to-end performance. The approach reduces resource and energy consumption while enabling scalable deployments by decoupling preprocessing from training and supporting streaming data processing. Overall, Piper demonstrates a practical path to more efficient end-to-end ML training in data centers by offloading expensive tabular preprocessing to specialized hardware.

Abstract

Data preprocessing pipelines, which includes data decoding, cleaning, and transforming, are a crucial component of Machine Learning (ML) training. Thy are computationally intensive and often become a major bottleneck, due to the increasing performance gap between the CPUs used for preprocessing and the GPUs used for model training. Recent studies show that a significant number of CPUs across several machines are required to achieve sufficient throughput to saturate the GPUs, leading to increased resource and energy consumption. When the pipeline involves vocabulary generation, the preprocessing performance scales poorly due to significant row-wise synchronization overhead between different CPU cores and servers. To address this limitation, in this paper we present the design of Piper, a hardware accelerator for tabular data preprocessing, prototype it on FPGAs, and demonstrate its potential for training pipelines of commercial recommender systems. Piper achieves 4.7 71.3 speedup in latency over a 128-core CPU server and outperforms a data-center GPU by 4.8 20.3 when using binary input. The impressive performance showcases Piper's potential to increase the efficiency of data preprocessing pipelines and significantly reduce their resource consumption.
Paper Structure (30 sections, 10 figures, 4 tables)

This paper contains 30 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Preprocessing vs training (one epoch, different batch sizes on one GPU).
  • Figure 2: System overview for DLRM data preprocessing pipeline on CPUs and Piper, respectively. Piper supports both PCIe and the network as the data movement interface. The white blocks represent parallel workers.
  • Figure 3: Dataflow of preprocessing pipelines in CPU.
  • Figure 4: An example of data preprocessing for a row of raw UTF-8 data, in which orange represents the labels, blue denotes tabs, green indicates dense features, and yellow denotes sparse features (8-byte hash values).
  • Figure 5: Piper accelerator overview. The dataflow involves two consecutive loops ① & ②. We use the same color of blocks as in Figure \ref{['fig:cpuflow']} to represent different types of operators.
  • ...and 5 more figures