GPolylla: Fully GPU-accelerated polygonal mesh generator
Sergio Salinas-Fernández, Nancy Hitschfeld-Kahler, Roberto Carrasco
TL;DR
This work tackles efficient polygonal mesh generation on GPUs by evolving Polylla into a fully GPU-accelerated tool, GPolylla, that converts an input triangulation $τ=(V,E)$ into a polygonal mesh $τ'=(V,E')$ via terminal-edge regions and the Longest-Edge Propagation Path ($Lepp$). It employs a half-edge (DCEL) representation on the GPU and performs in-place topology changes by updating next/prev pointers, avoiding dynamic memory allocation. The approach comprises a data-parallel label-traversal-repair pipeline across six CUDA kernels, with seed management and a tensor-core-accelerated scan to produce a final polygonal mesh entirely on the GPU. Empirical results show substantial speedups relative to CPU sequential implementations (up to ×83.2 including copy costs, ×746.8 excluding copy) and demonstrate scalability to large meshes, highlighting GPU data-transfer costs as a major factor and suggesting further optimizations using compact data layouts. The findings indicate that fully GPU-accelerated polygonal mesh generation is viable and practical for high-resolution simulations and VEM workflows.
Abstract
This work presents a fully GPU-accelerated algorithm for the polygonal mesh generator known as Polylla. Polylla is a tri-to-polygon mesh generator, which benefits from the half-edge data structure to manage any polygonal shape. The proposed parallel algorithm introduces a novel approach to modify triangulations to get polygonal meshes using the half-edge data structure in parallel on the GPU. By changing the adjacency values of each half-edge, the algorithm accomplish to unlink half-edges that are not used in the new polygonal mesh without the need neither removing nor allocating new memory in the GPU. The experimental results show a speedup, reaching up to $\times 83.2$ when compared to the CPU sequential implementation. Additionally, the speedup is $\times 746.8$ when the cost of copying the data structure from the host device and back is not included.
