Table of Contents
Fetching ...

VQhull: a Fast Planar Quickhull

Thomas Koopman, Jordy Aaldering, Bernard van Gastel, Sven-Bodo Scholz

TL;DR

VQhull presents a vectorized, parallel implementation of Quickhull for planar convex hulls that minimizes data movement and exploits CPU bandwidth. By introducing a vectorized in-place subset extraction and a two-phase parallelization (parallel step plus cleanup), it achieves up to 1.6–16× sequential and 1.5–11× parallel speedups over the state of the art, while approaching hardware bandwidth limits on non-NUMA systems and showing meaningful energy efficiency trends. The work includes extensive benchmarking across three platforms and three PBBS datasets, with a nuanced discussion of branch behavior, vectorization, memory subsystem effects, and energy consumption. The findings highlight that performance and energy can decouple in bandwidth-bound geometric algorithms and point to future directions like memory-bandwidth–reduction heuristics and NUMA-aware parallel strategies to push further gains.

Abstract

Finding the convex hull is a fundamental problem in computational geometry. Quickhull is a fast algorithm for finding convex hulls. In this paper, we present VQhull, a fast parallel implementation of Quickhull that exploits vector instructions, and coordinates CPU cores in a way that minimizes data movement. This implementation obtains a sequential runtime improvement of 1.6--16x, and a parallel runtime improvement of 1.5-11x compared to the state of the art on the Problem Based Benchmark Suite. VQhull achieves 85--100% of non-NUMA architectures' peak bandwidth, and 66--78% on our two-CPU NUMA system. This leaves little room for further improvements. A 4x speedup on 8 cores has a parallel efficiency of 50%. This suggests a waste of energy, but our measurements show a more complicated picture: energy usage may even be lower in parallel. Quickhull serves as a case study that runtime and energy consumption do not go hand in hand.

VQhull: a Fast Planar Quickhull

TL;DR

VQhull presents a vectorized, parallel implementation of Quickhull for planar convex hulls that minimizes data movement and exploits CPU bandwidth. By introducing a vectorized in-place subset extraction and a two-phase parallelization (parallel step plus cleanup), it achieves up to 1.6–16× sequential and 1.5–11× parallel speedups over the state of the art, while approaching hardware bandwidth limits on non-NUMA systems and showing meaningful energy efficiency trends. The work includes extensive benchmarking across three platforms and three PBBS datasets, with a nuanced discussion of branch behavior, vectorization, memory subsystem effects, and energy consumption. The findings highlight that performance and energy can decouple in bandwidth-bound geometric algorithms and point to future directions like memory-bandwidth–reduction heuristics and NUMA-aware parallel strategies to push further gains.

Abstract

Finding the convex hull is a fundamental problem in computational geometry. Quickhull is a fast algorithm for finding convex hulls. In this paper, we present VQhull, a fast parallel implementation of Quickhull that exploits vector instructions, and coordinates CPU cores in a way that minimizes data movement. This implementation obtains a sequential runtime improvement of 1.6--16x, and a parallel runtime improvement of 1.5-11x compared to the state of the art on the Problem Based Benchmark Suite. VQhull achieves 85--100% of non-NUMA architectures' peak bandwidth, and 66--78% on our two-CPU NUMA system. This leaves little room for further improvements. A 4x speedup on 8 cores has a parallel efficiency of 50%. This suggests a waste of energy, but our measurements show a more complicated picture: energy usage may even be lower in parallel. Quickhull serves as a case study that runtime and energy consumption do not go hand in hand.

Paper Structure

This paper contains 28 sections, 9 equations, 9 figures, 2 tables, 2 algorithms.

Figures (9)

  • Figure 1: An example of the first partitioning step of the Quickhull algorithm.
  • Figure 2: The vcompresspd instruction
  • Figure 3: Invariant for extracting the subsets $S_1$ and $S_2$ from a set of points $P$.
  • Figure 4: Parallel partition step for $3$ threads $t_0$, $t_1$, $t_2$
  • Figure 5: The Kuzmin dataset of $10^8$ points. The first partition only moves $r_1$. The second partition eliminates all remaining points.
  • ...and 4 more figures