Table of Contents
Fetching ...

PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage

Jigao Luo, Nils Boeschen, Muhammad El-Hindi, Carsten Binnig

TL;DR

The paper tackles storage-resident, large-scale OLAP on GPUs by reframing tensor computation runtimes (TCRs) as a substrate for distributed query processing. It introduces PystachIO, a PyTorch-based engine that aggressively overlaps computation with high-bandwidth network and NVMe storage I/O through chunked, streaming, and multi-path techniques, including queue-based scheduling and adaptive regulation. Key contributions include a detailed analysis of naive I/O in TCRs, several end-to-end optimizations (chunking, deferred synchronization, metadata caching, reader combining, dynamic buffering), and comprehensive end-to-end evaluation showing up to 3x speedups and linear scalability to SF1000 on 2 GPUs. The work demonstrates that tensor runtimes can serve as effective substrates for scalable, out-of-memory OLAP in modern AI data centers, enabling efficient analytical workloads beyond traditional in-memory single-node scenarios.

Abstract

The AI hardware boom has led modern data centers to adopt HPC-style architectures centered on distributed, GPU-centric computation. Large GPU clusters interconnected by fast RDMA networks and backed by high-bandwidth NVMe storage enable scalable computation and rapid access to storage-resident data. Tensor computation runtimes (TCRs), such as PyTorch, originally designed for AI workloads, have recently been shown to accelerate analytical workloads. However, prior work has primarily considered settings where the data fits in aggregated GPU memory. In this paper, we systematically study how TCRs can support scalable, distributed query processing for large-scale, storage-resident OLAP workloads. Although TCRs provide abstractions for network and storage I/O, naive use often underutilizes GPU and I/O bandwidth due to insufficient overlap between computation and data movement. As a core contribution, we present PystachIO, a PyTorch-based distributed OLAP engine that combines fast network and storage I/O with key optimizations to maximize GPU, network, and storage utilization. Our evaluation shows up to 3x end-to-end speedups over existing distributed GPU-based query processing approaches.

PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage

TL;DR

The paper tackles storage-resident, large-scale OLAP on GPUs by reframing tensor computation runtimes (TCRs) as a substrate for distributed query processing. It introduces PystachIO, a PyTorch-based engine that aggressively overlaps computation with high-bandwidth network and NVMe storage I/O through chunked, streaming, and multi-path techniques, including queue-based scheduling and adaptive regulation. Key contributions include a detailed analysis of naive I/O in TCRs, several end-to-end optimizations (chunking, deferred synchronization, metadata caching, reader combining, dynamic buffering), and comprehensive end-to-end evaluation showing up to 3x speedups and linear scalability to SF1000 on 2 GPUs. The work demonstrates that tensor runtimes can serve as effective substrates for scalable, out-of-memory OLAP in modern AI data centers, enabling efficient analytical workloads beyond traditional in-memory single-node scenarios.

Abstract

The AI hardware boom has led modern data centers to adopt HPC-style architectures centered on distributed, GPU-centric computation. Large GPU clusters interconnected by fast RDMA networks and backed by high-bandwidth NVMe storage enable scalable computation and rapid access to storage-resident data. Tensor computation runtimes (TCRs), such as PyTorch, originally designed for AI workloads, have recently been shown to accelerate analytical workloads. However, prior work has primarily considered settings where the data fits in aggregated GPU memory. In this paper, we systematically study how TCRs can support scalable, distributed query processing for large-scale, storage-resident OLAP workloads. Although TCRs provide abstractions for network and storage I/O, naive use often underutilizes GPU and I/O bandwidth due to insufficient overlap between computation and data movement. As a core contribution, we present PystachIO, a PyTorch-based distributed OLAP engine that combines fast network and storage I/O with key optimizations to maximize GPU, network, and storage utilization. Our evaluation shows up to 3x end-to-end speedups over existing distributed GPU-based query processing approaches.

Paper Structure

This paper contains 76 sections, 1 equation, 19 figures.

Figures (19)

  • Figure 1: TPC-H Q3 runtime at scale factors (SF) 500 and 1000 on two GPUs over SSD-resident, non-co-partitioned tables. A blocking PyTorch-style baseline executes sequentially, leading to long runtimes and out-of-memory errors. In contrast, PystachIO overlaps storage I/O, networking, and computation to reduce query time by over 3$\times$.
  • Figure 2: PystachIO: a database architecture that exploits fast network and storage hardware in AI data centers on top of the PyTorch TCR. Naive use of the TCR does not achieve optimal performance. PystachIO adds database-specific optimizations, including overlapping compute and I/O and parallel storage access, to enable highly efficient query processing.
  • Figure 3: TCRs provide fast networking and storage primitives: (a) Performance of PyTorch's AllToAll collective versus a naive distributed join. (b) SSD read performance with GDS enabled, measured during the I/O stage of a scan versus the end-to-end full scan.
  • Figure 4: Naive and optimized variants of a distributed hash join in PyTorch. The blocking variant executes all join phases sequentially, which is the most natural mapping to PyTorch but also the least efficient. Optimizations enable the overlap of phases (chunking and deferred synchronization) and can significantly improve runtime. For each variant, the expected GPU and network I/O utilization is shown in the top-right inset, and the scheduling order of operations (for the overlapped variants) is annotated in the bottom-right corner.
  • Figure 5: Distributed join performance on 2 GPU nodes for varying numbers of 8-byte columns (120M/320M build/probe tuples, hit ratio 0.5). Hatched bars show network shuffle time for the blocking variants.
  • ...and 14 more figures