Table of Contents
Fetching ...

Contiguous Graph Partitioning For Optimal Total Or Bottleneck Communication

Willow Ahrens

TL;DR

This work proposes the first near-linear time algorithms for several graph partitioning problems in the contiguous regime, and proposes a new bottleneck cost which reflects the sum of communication and computation on each part.

Abstract

Graph partitioning schedules parallel calculations like sparse matrix-vector multiply (SpMV). We consider contiguous partitions, where the $m$ rows (or columns) of a sparse matrix with $N$ nonzeros are split into $K$ parts without reordering. We propose the first near-linear time algorithms for several graph partitioning problems in the contiguous regime. Traditional objectives such as the simple edge cut, hyperedge cut, or hypergraph connectivity minimize the total cost of all parts under a balance constraint. Our total partitioners use $O(Km + N)$ space. They run in $O((Km\log(m) + N)\log(N))$ time, a significant improvement over prior $O(K(m^2 + N))$ time algorithms due to Kernighan and Grandjean et. al. Bottleneck partitioning minimizes the maximum cost of any part. We propose a new bottleneck cost which reflects the sum of communication and computation on each part. Our bottleneck partitioners use linear space. The exact algorithm runs in linear time when $K^2$ is $O(N^C)$ for $C < 1$. Our $(1 + ε)$-approximate algorithm runs in linear time when $K\log(c_{high}/(c_{low}ε))$ is $O(N^C)$ for $C < 1$, where $c_{high}$ and $c_{low}$ are upper and lower bounds on the optimal cost. We also propose a simpler $(1 + ε)$-approximate algorithm which runs in a factor of $\log(c_{high}/(c_{low}ε))$ from linear time. We empirically demonstrate that our algorithms efficiently produce high-quality contiguous partitions on a test suite of 42 test matrices. When $K = 8$, our hypergraph connectivity partitioner achieved a speedup of $53\times$ (mean $15.1\times$) over prior algorithms. The mean runtime of our bottleneck partitioner was 5.15 SpMVs.

Contiguous Graph Partitioning For Optimal Total Or Bottleneck Communication

TL;DR

This work proposes the first near-linear time algorithms for several graph partitioning problems in the contiguous regime, and proposes a new bottleneck cost which reflects the sum of communication and computation on each part.

Abstract

Graph partitioning schedules parallel calculations like sparse matrix-vector multiply (SpMV). We consider contiguous partitions, where the rows (or columns) of a sparse matrix with nonzeros are split into parts without reordering. We propose the first near-linear time algorithms for several graph partitioning problems in the contiguous regime. Traditional objectives such as the simple edge cut, hyperedge cut, or hypergraph connectivity minimize the total cost of all parts under a balance constraint. Our total partitioners use space. They run in time, a significant improvement over prior time algorithms due to Kernighan and Grandjean et. al. Bottleneck partitioning minimizes the maximum cost of any part. We propose a new bottleneck cost which reflects the sum of communication and computation on each part. Our bottleneck partitioners use linear space. The exact algorithm runs in linear time when is for . Our -approximate algorithm runs in linear time when is for , where and are upper and lower bounds on the optimal cost. We also propose a simpler -approximate algorithm which runs in a factor of from linear time. We empirically demonstrate that our algorithms efficiently produce high-quality contiguous partitions on a test suite of 42 test matrices. When , our hypergraph connectivity partitioner achieved a speedup of (mean ) over prior algorithms. The mean runtime of our bottleneck partitioner was 5.15 SpMVs.

Paper Structure

This paper contains 21 sections, 43 equations, 8 figures, 3 tables, 6 algorithms.

Figures (8)

  • Figure 1: Our running example matrix, together with an example symmetric partition of $x$ and $y$. Nonzeros are denoted with $*$.
  • Figure 2: Links of our example matrix $A$ are illustrated as line segments connecting elements of $A$ on left, and as points (with labeled multiplicities) on right. Links residing entirely within part $2$ are shown in bold. Part 2 contains two links starting at $i = 3$ and terminating at $i = 5$, and three links starting at $i = 5$ and terminating at $i=6$. In total, part 2 contains $1 + 2 + 1 + 1 + 3 = 8$ links, which is equal to the number of points dominated by our dotted region representing the partition split points.
  • Figure 3: The feasible regions of setup and cleanup phases under a weight limit of 12 nonzeros per part. The first (nontrivial) setup phase (irregular upper region) and cleanup phase (triangular lower region) are displayed in bold. The pairs $(\sigma_1, \sigma'_1), ... (\sigma_8, \sigma'_8)$ are shown from the upper left to the lower right of the feasible region of the setup phase. Compare this figure to the strictly triangular phases of eppstein_sequence_1990.
  • Figure 4: Performance profiles comparing normalized modeled quality of our general (possibly noncontiguous) partitioners (Table \ref{['tbl:partitioners']}) on symmetric and asymmetric test matrices (Table \ref{['tbl:matrices']}) in realistic and infinite reuse situations. Quality is measured with cost \ref{['eq:nonsymmetriccost']}, using the coefficients $c_{\textbf{entry}} = 1$, $c_{\textbf{row}} = 10$, and $c_{\textbf{message}} = 100$. For symmetric matrices, we require that the associated partitions be symmetric (we use the same partition for rows and columns). Our asymmetric test matrices also include their transposes. Some of the partitioners may reorder the matrix; setup time includes reordering operations.
  • Figure 5: Performance profiles comparing normalized modeled quality of our general (possibly noncontiguous) partitioners (Table \ref{['tbl:partitioners']}) on symmetric and asymmetric test matrices (Table \ref{['tbl:matrices']}) in realistic and infinite reuse situations. Quality is measured with cost \ref{['eq:nonsymmetriccost']}, using the coefficients $c_{\textbf{entry}} = 1$, $c_{\textbf{row}} = 10$, and $c_{\textbf{message}} = 100$. For symmetric matrices, we require that the associated partitions be symmetric (we use the same partition for rows and columns). Our asymmetric test matrices also include their transposes. Some of the partitioners may reorder the matrix; setup time includes reordering operations.
  • ...and 3 more figures