Table of Contents
Fetching ...

GetBatch: Distributed Multi-Object Retrieval for ML Data Loading

Alex Aizman, Abhishek Gaikwad, Piotr Żelasko

TL;DR

GetBatch is a new object store API that elevates batch retrieval to a first-class storage operation, replacing independent GET operations with a single deterministic, fault-tolerant streaming execution.

Abstract

Machine learning training pipelines consume data in batches. A single training step may require thousands of samples drawn from shards distributed across a storage cluster. Issuing thousands of individual GET requests incurs per-request overhead that often dominates data transfer time. To solve this problem, we introduce GetBatch - a new object store API that elevates batch retrieval to a first-class storage operation, replacing independent GET operations with a single deterministic, fault-tolerant streaming execution. GetBatch achieves up to 15x throughput improvement for small objects and, in a production training workload, reduces P95 batch retrieval latency by 2x and P99 per-object tail latency by 3.7x compared to individual GET requests.

GetBatch: Distributed Multi-Object Retrieval for ML Data Loading

TL;DR

GetBatch is a new object store API that elevates batch retrieval to a first-class storage operation, replacing independent GET operations with a single deterministic, fault-tolerant streaming execution.

Abstract

Machine learning training pipelines consume data in batches. A single training step may require thousands of samples drawn from shards distributed across a storage cluster. Issuing thousands of individual GET requests incurs per-request overhead that often dominates data transfer time. To solve this problem, we introduce GetBatch - a new object store API that elevates batch retrieval to a first-class storage operation, replacing independent GET operations with a single deterministic, fault-tolerant streaming execution. GetBatch achieves up to 15x throughput improvement for small objects and, in a production training workload, reduces P95 batch retrieval latency by 2x and P99 per-object tail latency by 3.7x compared to individual GET requests.
Paper Structure (47 sections, 3 figures, 2 tables)

This paper contains 47 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: AIStore sequential and batched random access data loading patterns. Sequential I/O (a) reads entire shards and selects samples from a buffer, while GetBatch (b) retrieves only the sampled items in a single request.
  • Figure 2: GetBatch execution model. A client submits a batch request to a proxy, which selects a Designated Target (DT). The proxy activates all other targets as senders. Senders stream locally owned data to the DT over peer-to-peer paths, and the DT emits a single output stream in strict request order.
  • Figure 3: Sustained throughput comparison between individual GET and GetBatch across object sizes and batch configurations. GetBatch yields the largest gains for small objects, where per-request overhead dominates.