Table of Contents
Fetching ...

Load Balanced Parallel Node Generation for Meshless Numerical Methods

Jon Vehovar, Miha Rot, Matjaž Depolli, Gregor Kosec

TL;DR

This work tackles the challenge of generating meshless nodes with quasi-uniform density in complex geometries using parallelism. It introduces a parallel advancing-front algorithm that employs a d-dimensional hypertree for spatial indexing and a prebuilt work tree to balance workload, minimizing locking through leaf-level separation and a restart mechanism. Compared with the prior Pfill approach on a disc domain, the new method yields notable throughput gains and scalable performance up to 64 hardware threads, with efficiency decreasing at higher thread counts due to synchronization and cache effects. The authors also discuss adapting the method to distributed memory and identify avenues for improvement, including adaptive leaf splitting and diagnosing a performance bifurcation phenomenon. Overall, the approach offers a promising path for scalable, load-balanced node generation in meshless numerical methods, with potential applicability to adaptive, distributed simulations.

Abstract

Meshless methods are used to solve partial differential equations by approximating differential operators at a node as a weighted sum of values at its neighbours. One of the algorithms for generating nodes suitable for meshless numerical analysis is an n-dimensional Poisson disc sampling based method. It can handle complex geometries and supports variable node density, a crucial feature for adaptive analysis. We modify this method for parallel execution using coupled spatial indexing and work distribution hypertrees. The latter is prebuilt according to the node density function, ensuring that each leaf represents a balanced work unit. Threads advance separate fronts and claim work hypertree leaves as needed while avoiding leaves neighbouring those claimed by other threads. Node placement constraints and the partially prebuilt spatial hypertree are combined to eliminate the need to lock the tree while it is being modified. Thread collision handling is managed by the work hypertree at the leaf level, drastically reducing the number of required mutex acquisitions for point insertion collision checks. We explore the behaviour of the proposed algorithm and compare the performance with existing attempts at parallelisation and consider the requirements for adapting the developed algorithm to distributed systems.

Load Balanced Parallel Node Generation for Meshless Numerical Methods

TL;DR

This work tackles the challenge of generating meshless nodes with quasi-uniform density in complex geometries using parallelism. It introduces a parallel advancing-front algorithm that employs a d-dimensional hypertree for spatial indexing and a prebuilt work tree to balance workload, minimizing locking through leaf-level separation and a restart mechanism. Compared with the prior Pfill approach on a disc domain, the new method yields notable throughput gains and scalable performance up to 64 hardware threads, with efficiency decreasing at higher thread counts due to synchronization and cache effects. The authors also discuss adapting the method to distributed memory and identify avenues for improvement, including adaptive leaf splitting and diagnosing a performance bifurcation phenomenon. Overall, the approach offers a promising path for scalable, load-balanced node generation in meshless numerical methods, with potential applicability to adaptive, distributed simulations.

Abstract

Meshless methods are used to solve partial differential equations by approximating differential operators at a node as a weighted sum of values at its neighbours. One of the algorithms for generating nodes suitable for meshless numerical analysis is an n-dimensional Poisson disc sampling based method. It can handle complex geometries and supports variable node density, a crucial feature for adaptive analysis. We modify this method for parallel execution using coupled spatial indexing and work distribution hypertrees. The latter is prebuilt according to the node density function, ensuring that each leaf represents a balanced work unit. Threads advance separate fronts and claim work hypertree leaves as needed while avoiding leaves neighbouring those claimed by other threads. Node placement constraints and the partially prebuilt spatial hypertree are combined to eliminate the need to lock the tree while it is being modified. Thread collision handling is managed by the work hypertree at the leaf level, drastically reducing the number of required mutex acquisitions for point insertion collision checks. We explore the behaviour of the proposed algorithm and compare the performance with existing attempts at parallelisation and consider the requirements for adapting the developed algorithm to distributed systems.
Paper Structure (11 sections, 4 figures)

This paper contains 11 sections, 4 figures.

Figures (4)

  • Figure 1: An example of the test domain being filled by 7 threads with $h = 0.01$ and leaf size limit set to 100 while nearing the end of the algorithm's first stage. Points of the same colour were all placed by the same thread. Threads can be seen in various stages of filling with some idling while others are in the process of backfilling seams made previously while the gaps between others are still being formed.
  • Figure 2: Parameter sweep for working leaf size calibration. Different colors correspond for different number of CPU cores used and the markers denote different problem sizes.
  • Figure 3: Strong scaling tests for the presented algorithm and Pfill. The top panel shows the total point throughput while the bottom panel shows the throughput per thread. Markers connected by dashed lines represent averages of markers of the same color and number of threads used.
  • Figure 4: Average fractions of time threads spend actively computing with respect to the total computation time for different number of threads and two different problem sizes for benchmarks shown in \ref{['fig:strong_scaling']}. The black horizontal dashed line indicates the maximum possible thread activity while the offset carets are used for legibility when displaying data from different problem sizes at similar coordinates.