GetBatch: Distributed Multi-Object Retrieval for ML Data Loading

Alex Aizman; Abhishek Gaikwad; Piotr Żelasko

GetBatch: Distributed Multi-Object Retrieval for ML Data Loading

Alex Aizman, Abhishek Gaikwad, Piotr Żelasko

TL;DR

GetBatch is a new object store API that elevates batch retrieval to a first-class storage operation, replacing independent GET operations with a single deterministic, fault-tolerant streaming execution.

Abstract

Machine learning training pipelines consume data in batches. A single training step may require thousands of samples drawn from shards distributed across a storage cluster. Issuing thousands of individual GET requests incurs per-request overhead that often dominates data transfer time. To solve this problem, we introduce GetBatch - a new object store API that elevates batch retrieval to a first-class storage operation, replacing independent GET operations with a single deterministic, fault-tolerant streaming execution. GetBatch achieves up to 15x throughput improvement for small objects and, in a production training workload, reduces P95 batch retrieval latency by 2x and P99 per-object tail latency by 3.7x compared to individual GET requests.

GetBatch: Distributed Multi-Object Retrieval for ML Data Loading

TL;DR

Abstract

Paper Structure (47 sections, 3 figures, 2 tables)

This paper contains 47 sections, 3 figures, 2 tables.

Introduction
Methods
AIStore
GetBatch Design and Execution Semantics
GetBatch Server-side Execution
Execution Flow
Phase 1: DT Registration.
Phase 2: Distributed Sender Activation.
Phase 3: Client Redirection and Ordered Assembly.
GetBatch Execution Options and Capabilities
Request-level Execution Options
Streaming (strm).
Continue-on-error (coer).
Colocation hints (coloc).
Fault Handling and Completion
...and 32 more sections

Figures (3)

Figure 1: AIStore sequential and batched random access data loading patterns. Sequential I/O (a) reads entire shards and selects samples from a buffer, while GetBatch (b) retrieves only the sampled items in a single request.
Figure 2: GetBatch execution model. A client submits a batch request to a proxy, which selects a Designated Target (DT). The proxy activates all other targets as senders. Senders stream locally owned data to the DT over peer-to-peer paths, and the DT emits a single output stream in strict request order.
Figure 3: Sustained throughput comparison between individual GET and GetBatch across object sizes and batch configurations. GetBatch yields the largest gains for small objects, where per-request overhead dominates.

GetBatch: Distributed Multi-Object Retrieval for ML Data Loading

TL;DR

Abstract

GetBatch: Distributed Multi-Object Retrieval for ML Data Loading

Authors

TL;DR

Abstract

Table of Contents

Figures (3)