Table of Contents
Fetching ...

tf.data service: A Case for Disaggregating ML Input Data Processing

Andrew Audibert, Yang Chen, Dan Graur, Ana Klimovic, Jiri Simsa, Chandramohan A. Thekkath

TL;DR

This work introduces tf.data service, a disaggregated input data processing system that runs data preprocessing on remote CPU/RAM resources separate from ML accelerators. By decoupling data preparation from model training, it enables horizontal scale-out, ephemeral data sharing, and coordinated reads, reducing input bottlenecks and straggler effects while preserving model accuracy. Empirical results across vision and NLP workloads show substantial gains, including an average 31.7x speedup and 26.2x cost savings, with additional improvements from sharing preprocessed data and coordinating reads. The paper argues for viewing ML data processing as a multi-tenant service and outlines design, deployment, and hardware implications to guide future research and practice.

Abstract

Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. For cost efficiency, it is essential to keep these accelerators highly utilized. This requires preprocessing input data at the rate at which the accelerators can ingest and perform ML computations on the data. To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs. Hence, the traditional approach of processing input data on ML accelerator hosts with a fixed hardware ratio leads to either under-utilizing the accelerators or the host CPU and RAM. In this paper, we address these concerns by building a disaggregated ML data processing system. We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in TensorFlow. We show that disaggregating data preprocessing has three key advantages for large-scale ML training jobs. First, the service can horizontally scale-out to right-size CPU/RAM host resources for data processing in each job, saving 32x training time and 26x cost, on average. Second, the service can share ephemeral preprocessed data results across jobs, to optimize CPU usage and reduce redundant computations. Finally, the service supports coordinated reads, a technique that avoids stragglers due to different input sizes in distributed training, reducing training time by 2.2x, on average. Our design is inspired by lessons learned from deploying tf.data service in production, including relaxing data visitation guarantees without impacting model accuracy.

tf.data service: A Case for Disaggregating ML Input Data Processing

TL;DR

This work introduces tf.data service, a disaggregated input data processing system that runs data preprocessing on remote CPU/RAM resources separate from ML accelerators. By decoupling data preparation from model training, it enables horizontal scale-out, ephemeral data sharing, and coordinated reads, reducing input bottlenecks and straggler effects while preserving model accuracy. Empirical results across vision and NLP workloads show substantial gains, including an average 31.7x speedup and 26.2x cost savings, with additional improvements from sharing preprocessed data and coordinating reads. The paper argues for viewing ML data processing as a multi-tenant service and outlines design, deployment, and hardware implications to guide future research and practice.

Abstract

Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. For cost efficiency, it is essential to keep these accelerators highly utilized. This requires preprocessing input data at the rate at which the accelerators can ingest and perform ML computations on the data. To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs. Hence, the traditional approach of processing input data on ML accelerator hosts with a fixed hardware ratio leads to either under-utilizing the accelerators or the host CPU and RAM. In this paper, we address these concerns by building a disaggregated ML data processing system. We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in TensorFlow. We show that disaggregating data preprocessing has three key advantages for large-scale ML training jobs. First, the service can horizontally scale-out to right-size CPU/RAM host resources for data processing in each job, saving 32x training time and 26x cost, on average. Second, the service can share ephemeral preprocessed data results across jobs, to optimize CPU usage and reduce redundant computations. Finally, the service supports coordinated reads, a technique that avoids stragglers due to different input sizes in distributed training, reducing training time by 2.2x, on average. Our design is inspired by lessons learned from deploying tf.data service in production, including relaxing data visitation guarantees without impacting model accuracy.
Paper Structure (50 sections, 1 equation, 12 figures)

This paper contains 50 sections, 1 equation, 12 figures.

Figures (12)

  • Figure 1: CDFs of normalized ML host resource usage for over 73k colocated processing jobs running in Google datacenters over a 24 hour period. The takeaway is that host resource requirements vary widely across ML jobs.
  • Figure 2: RetinaNet lin2017focal CPU/MEM usage when training on COCO coco using a GCP TPU v2-8 VM google-cloud-tpu-config.
  • Figure 3: Architecture and workflow. Solid lines represent the data path, dashed lines represent the control path, and dotted lines the execution flow.
  • Figure 4: tf.data service API example.
  • Figure 5: Ephemeral data sharing workers serve requests from different jobs via sliding window caches.
  • ...and 7 more figures