Load Balanced Parallel Node Generation for Meshless Numerical Methods
Jon Vehovar, Miha Rot, Matjaž Depolli, Gregor Kosec
TL;DR
This work tackles the challenge of generating meshless nodes with quasi-uniform density in complex geometries using parallelism. It introduces a parallel advancing-front algorithm that employs a d-dimensional hypertree for spatial indexing and a prebuilt work tree to balance workload, minimizing locking through leaf-level separation and a restart mechanism. Compared with the prior Pfill approach on a disc domain, the new method yields notable throughput gains and scalable performance up to 64 hardware threads, with efficiency decreasing at higher thread counts due to synchronization and cache effects. The authors also discuss adapting the method to distributed memory and identify avenues for improvement, including adaptive leaf splitting and diagnosing a performance bifurcation phenomenon. Overall, the approach offers a promising path for scalable, load-balanced node generation in meshless numerical methods, with potential applicability to adaptive, distributed simulations.
Abstract
Meshless methods are used to solve partial differential equations by approximating differential operators at a node as a weighted sum of values at its neighbours. One of the algorithms for generating nodes suitable for meshless numerical analysis is an n-dimensional Poisson disc sampling based method. It can handle complex geometries and supports variable node density, a crucial feature for adaptive analysis. We modify this method for parallel execution using coupled spatial indexing and work distribution hypertrees. The latter is prebuilt according to the node density function, ensuring that each leaf represents a balanced work unit. Threads advance separate fronts and claim work hypertree leaves as needed while avoiding leaves neighbouring those claimed by other threads. Node placement constraints and the partially prebuilt spatial hypertree are combined to eliminate the need to lock the tree while it is being modified. Thread collision handling is managed by the work hypertree at the leaf level, drastically reducing the number of required mutex acquisitions for point insertion collision checks. We explore the behaviour of the proposed algorithm and compare the performance with existing attempts at parallelisation and consider the requirements for adapting the developed algorithm to distributed systems.
