PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage
Jigao Luo, Nils Boeschen, Muhammad El-Hindi, Carsten Binnig
TL;DR
The paper tackles storage-resident, large-scale OLAP on GPUs by reframing tensor computation runtimes (TCRs) as a substrate for distributed query processing. It introduces PystachIO, a PyTorch-based engine that aggressively overlaps computation with high-bandwidth network and NVMe storage I/O through chunked, streaming, and multi-path techniques, including queue-based scheduling and adaptive regulation. Key contributions include a detailed analysis of naive I/O in TCRs, several end-to-end optimizations (chunking, deferred synchronization, metadata caching, reader combining, dynamic buffering), and comprehensive end-to-end evaluation showing up to 3x speedups and linear scalability to SF1000 on 2 GPUs. The work demonstrates that tensor runtimes can serve as effective substrates for scalable, out-of-memory OLAP in modern AI data centers, enabling efficient analytical workloads beyond traditional in-memory single-node scenarios.
Abstract
The AI hardware boom has led modern data centers to adopt HPC-style architectures centered on distributed, GPU-centric computation. Large GPU clusters interconnected by fast RDMA networks and backed by high-bandwidth NVMe storage enable scalable computation and rapid access to storage-resident data. Tensor computation runtimes (TCRs), such as PyTorch, originally designed for AI workloads, have recently been shown to accelerate analytical workloads. However, prior work has primarily considered settings where the data fits in aggregated GPU memory. In this paper, we systematically study how TCRs can support scalable, distributed query processing for large-scale, storage-resident OLAP workloads. Although TCRs provide abstractions for network and storage I/O, naive use often underutilizes GPU and I/O bandwidth due to insufficient overlap between computation and data movement. As a core contribution, we present PystachIO, a PyTorch-based distributed OLAP engine that combines fast network and storage I/O with key optimizations to maximize GPU, network, and storage utilization. Our evaluation shows up to 3x end-to-end speedups over existing distributed GPU-based query processing approaches.
