Table of Contents
Fetching ...

ZipFlow: a Compiler-based Framework to Unleash Compressed Data Movement for Modern GPUs

Gwangoo Yeo, Zhiyang Shen, Wei Cui, Matteo Interlandi, Rathijit Sen, Bailu Ding, Qi Chen, Minsoo Rhu

TL;DR

This work tackles the end-to-end data movement bottlenecks in GPU-accelerated analytics caused by transferring data over PCIe when GPU memory is not sufficient. It introduces ZipFlow, a compiler-based framework that abstracts compression algorithms into three parallel compute patterns and uses device-geometry scheduling to optimize end-to-end performance, including nesting, fusion, and pipelining. Across TPC-H benchmarks, ZipFlow achieves significant gains over nvCOMP and CPU baselines, with average end-to-end speedups of about 2.08x and 3.14x respectively, driven by improved compression ratios, faster decompression, and reduced I/O latency. The framework demonstrates robust performance across heterogeneous GPUs and highlights the importance of holistic optimization in data movement, making compression a more powerful lever for GPU-accelerated analytics.

Abstract

In GPU-accelerated data analytics, the overhead of data transfer from CPU to GPU becomes a performance bottleneck when the data scales beyond GPU memory capacity due to the limited PCIe bandwidth. Data compression has come to rescue for reducing the amount of data transfer while taking advantage of the powerful GPU computation for decompression. To optimize the end-to-end query performance, however, the workflow of data compression, transfer, and decompression must be holistically designed based on the compression strategies and hardware characteristics to balance the I/O latency and computational overhead. In this work, we present ZipFlow, a compiler-based framework for optimizing compressed data transfer in GPU-accelerated data analytics. ZipFlow classifies compression algorithms into three distinct patterns based on their inherent parallelism. For each pattern, ZipFlow employs generalized scheduling strategies to effectively exploit the computational power of GPUs across diverse architectures. Building on these patterns, ZipFlow delivers flexible, high-performance, and holistic optimization, which substantially advances end-to-end data transfer capabilities. We evaluate the effectiveness of ZipFlow on industry-standard benchmark, TPC-H. Overall, ZipFlow achieves an average improvement of 2.08 times over the state-of-the-art GPU compression library (nvCOMP) and 3.14 times speedup against CPU-based query processing engines (e.g., DuckDB).

ZipFlow: a Compiler-based Framework to Unleash Compressed Data Movement for Modern GPUs

TL;DR

This work tackles the end-to-end data movement bottlenecks in GPU-accelerated analytics caused by transferring data over PCIe when GPU memory is not sufficient. It introduces ZipFlow, a compiler-based framework that abstracts compression algorithms into three parallel compute patterns and uses device-geometry scheduling to optimize end-to-end performance, including nesting, fusion, and pipelining. Across TPC-H benchmarks, ZipFlow achieves significant gains over nvCOMP and CPU baselines, with average end-to-end speedups of about 2.08x and 3.14x respectively, driven by improved compression ratios, faster decompression, and reduced I/O latency. The framework demonstrates robust performance across heterogeneous GPUs and highlights the importance of holistic optimization in data movement, making compression a more powerful lever for GPU-accelerated analytics.

Abstract

In GPU-accelerated data analytics, the overhead of data transfer from CPU to GPU becomes a performance bottleneck when the data scales beyond GPU memory capacity due to the limited PCIe bandwidth. Data compression has come to rescue for reducing the amount of data transfer while taking advantage of the powerful GPU computation for decompression. To optimize the end-to-end query performance, however, the workflow of data compression, transfer, and decompression must be holistically designed based on the compression strategies and hardware characteristics to balance the I/O latency and computational overhead. In this work, we present ZipFlow, a compiler-based framework for optimizing compressed data transfer in GPU-accelerated data analytics. ZipFlow classifies compression algorithms into three distinct patterns based on their inherent parallelism. For each pattern, ZipFlow employs generalized scheduling strategies to effectively exploit the computational power of GPUs across diverse architectures. Building on these patterns, ZipFlow delivers flexible, high-performance, and holistic optimization, which substantially advances end-to-end data transfer capabilities. We evaluate the effectiveness of ZipFlow on industry-standard benchmark, TPC-H. Overall, ZipFlow achieves an average improvement of 2.08 times over the state-of-the-art GPU compression library (nvCOMP) and 3.14 times speedup against CPU-based query processing engines (e.g., DuckDB).
Paper Structure (27 sections, 2 equations, 22 figures, 3 tables)

This paper contains 27 sections, 2 equations, 22 figures, 3 tables.

Figures (22)

  • Figure 1: Latency breakdown of 22 TPC-H queries (SF=100) on CPU (AMD EPYC 7V12) sqlserverduckdb, A100 (PCIe-4), and H100 (PCIe-5), TQP tqp is used for query processing on GPU.
  • Figure 2: Compression algorithms of different families.
  • Figure 3: End-to-end execution pipeline from stored data to ZipFlow.
  • Figure 4: ZipFlow Architecture Overview.
  • Figure 5: Logical dependencies for ZipFlow parallel patterns.
  • ...and 17 more figures