Xorbits: Automating Operator Tiling for Distributed Data Science

Weizheng Lu; Kaisheng He; Xuye Qin; Chengjie Li; Zhong Wang; Tao Yuan; Xia Liao; Feng Zhang; Yueguo Chen; Xiaoyong Du

Xorbits: Automating Operator Tiling for Distributed Data Science

Weizheng Lu, Kaisheng He, Xuye Qin, Chengjie Li, Zhong Wang, Tao Yuan, Xia Liao, Feng Zhang, Yueguo Chen, Xiaoyong Du

TL;DR

Xorbits tackles the bottlenecks of scaling pandas/NumPy workloads on clusters by introducing dynamic tiling and a tri-level computation-graph framework (tileable, chunk, subtask) that can switch between graph construction and execution. The system preserves API compatibility, enabling drop-in use with pandas/NumPy while avoiding data-skew and OOM through metadata-driven tiling, graph fusion, scheduling, and an intermediate storage layer. Key contributions include a formalization of three computation graphs, a yield-based dynamic tiling mechanism, and optimizations such as graph fusion, auto rechunking, and memory-aware storage/backends, validated by benchmarks showing up to 2.66x speedups and 96.7% API coverage. The results demonstrate Xorbits’ ability to scale data science and array workloads to thousands of cores with strong efficiency and compatibility, offering a practical path for deploying large-scale DS pipelines without rewriting code.

Abstract

Data science pipelines commonly utilize dataframe and array operations for tasks such as data preprocessing, analysis, and machine learning. The most popular tools for these tasks are pandas and NumPy. However, these tools are limited to executing on a single node, making them unsuitable for processing large-scale data. Several systems have attempted to distribute data science applications to clusters while maintaining interfaces similar to single-node libraries, enabling data scientists to scale their workloads without significant effort. However, existing systems often struggle with processing large datasets due to Out-of-Memory (OOM) problems caused by poor data partitioning. To overcome these challenges, we develop Xorbits, a high-performance, scalable data science framework specifically designed to distribute data science workloads across clusters while retaining familiar APIs. The key differentiator of Xorbits is its ability to dynamically switch between graph construction and graph execution. Xorbits has been successfully deployed in production environments with up to 5k CPU cores. Its applications span various domains, including user behavior analysis and recommendation systems in the e-commerce sector, as well as credit assessment and risk management in the finance industry. Users can easily scale their data science workloads by simply changing the import line of their pandas and NumPy code. Our experiments demonstrate that Xorbits can effectively process very large datasets without encountering OOM or data-skewing problems. Over the fastest state-of-the-art solutions, Xorbits achieves an impressive 2.66* speedup on average. In terms of API coverage, Xorbits attains a compatibility rate of 96.7%, surpassing the fastest framework by an impressive margin of 60 percentage points. Xorbits is available at https://github.com/xorbitsai/xorbits.

Xorbits: Automating Operator Tiling for Distributed Data Science

TL;DR

Abstract

Xorbits: Automating Operator Tiling for Distributed Data Science

Authors

TL;DR

Abstract

Table of Contents

Figures (10)