Table of Contents
Fetching ...

Partitioning Trillion Edge Graphs on Edge Devices

Adil Chhabra, Florian Kurpicz, Christian Schulz, Dominik Schweisgut, Daniel Seemaier

TL;DR

This paper introduces StreamCPI, a novel framework that further reduces the memory overhead of streaming partitioners through run-length compression of block assignments and proposes a modification to the LA-vector bit vector for append support, which can be used for online run-length compression in other streaming applications.

Abstract

Processing large-scale graphs, containing billions of entities, is critical across fields like bioinformatics, high-performance computing, navigation and route planning, among others. Efficient graph partitioning, which divides a graph into sub-graphs while minimizing inter-block edges, is essential to graph processing, as it optimizes parallel computing and enhances data locality. Traditional in-memory partitioners, such as METIS and KaHIP, offer high-quality partitions but are often infeasible for enormous graphs due to their substantial memory overhead. Streaming partitioners reduce memory usage to O(n), where 'n' is the number of nodes of the graph, by loading nodes sequentially and assigning them to blocks on-the-fly. This paper introduces StreamCPI, a novel framework that further reduces the memory overhead of streaming partitioners through run-length compression of block assignments. Notably, StreamCPI enables the partitioning of trillion-edge graphs on edge devices. Additionally, within this framework, we propose a modification to the LA-vector bit vector for append support, which can be used for online run-length compression in other streaming applications. Empirical results show that StreamCPI reduces memory usage while maintaining or improving partition quality. For instance, using StreamCPI, the Fennel partitioner effectively partitions a graph with 17 billion nodes and 1.03 trillion edges on a Raspberry Pi, achieving significantly better solution quality than Hashing, the only other feasible algorithm on edge devices. StreamCPI thus advances graph processing by enabling high-quality partitioning on low-cost machines.

Partitioning Trillion Edge Graphs on Edge Devices

TL;DR

This paper introduces StreamCPI, a novel framework that further reduces the memory overhead of streaming partitioners through run-length compression of block assignments and proposes a modification to the LA-vector bit vector for append support, which can be used for online run-length compression in other streaming applications.

Abstract

Processing large-scale graphs, containing billions of entities, is critical across fields like bioinformatics, high-performance computing, navigation and route planning, among others. Efficient graph partitioning, which divides a graph into sub-graphs while minimizing inter-block edges, is essential to graph processing, as it optimizes parallel computing and enhances data locality. Traditional in-memory partitioners, such as METIS and KaHIP, offer high-quality partitions but are often infeasible for enormous graphs due to their substantial memory overhead. Streaming partitioners reduce memory usage to O(n), where 'n' is the number of nodes of the graph, by loading nodes sequentially and assigning them to blocks on-the-fly. This paper introduces StreamCPI, a novel framework that further reduces the memory overhead of streaming partitioners through run-length compression of block assignments. Notably, StreamCPI enables the partitioning of trillion-edge graphs on edge devices. Additionally, within this framework, we propose a modification to the LA-vector bit vector for append support, which can be used for online run-length compression in other streaming applications. Empirical results show that StreamCPI reduces memory usage while maintaining or improving partition quality. For instance, using StreamCPI, the Fennel partitioner effectively partitions a graph with 17 billion nodes and 1.03 trillion edges on a Raspberry Pi, achieving significantly better solution quality than Hashing, the only other feasible algorithm on edge devices. StreamCPI thus advances graph processing by enabling high-quality partitioning on low-cost machines.

Paper Structure

This paper contains 15 sections, 1 equation, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Schematic of streaming partitioning of a node into orange (circle) and blue (square) blocks. It shows the information held by the partitioner at any instance of streaming: (1) the neighborhood of the current node (node 5), and (2) a container of all nodes' block assignments. Block assignments are used to compute a partitioning score, and node 5 is assigned to the block with the higher score. Run-length encoding of block assignments is shown along with its array representation.
  • Figure 2: Run-length compression of a sequence $S$ using a bit vector $B$ and an array of run heads $A_{\text{head}}$. The sequence and bit vector have size $n$. The bit vector contains 1-bit if a new run starts at the corresponding position in $S$, or 0-bit otherwise. In practice, the bit vector is itself compressed. Computing $rank_1(i)$ of an index $i$ returns the number of 1's in the bit vector that occur before index $i$ in $B$. Here, $i=6$ and $\rho_6 = rank_1(6) - 1$ returns the position of the run in $A_{\text{head}}$ that $i=6$ belongs to, i.e, $A_{\text{head}}[\rho_6]$ = $A_{\text{head}}[2]$ = 3 = $S[6]$.
  • Figure 3: Comparison of $\textsc{StreamCpi}_{\beta=n,\kappa=1}$ with $\textsc{StreamCpi}_{\beta=10k,\kappa=1}$ and $\textsc{StreamCpi}_{\beta=100k,\kappa=1}$, i.e., batch-wiseStreamCpi, with a batch size of 10 000 and 100 000, on the Tuning Set using performance profiles.
  • Figure 4: Comparison of $\textsc{StreamCpi}_{\beta=n,\kappa=1}$ with $\textsc{StreamCpi}_{\beta=n,\kappa=5}$, $\textsc{StreamCpi}_{\beta=n,\kappa=10}$, $\textsc{StreamCpi}_{\beta=n,\kappa=15}$, $\textsc{StreamCpi}_{\beta=n,\kappa=20}$, i.e., $\kappa$-modifiedStreamCpi, with $\kappa$ = 5, 10, 15, and 20, on the Tuning Set using performance profiles.
  • Figure 5: Comparison of default StreamCpi, with batch-wise and $\kappa$-modifiedStreamCpi, Fennel-Array and Fennel-ExtPQ on the Test Set using performance profiles.