A fast MPI-based Distributed Hash-Table as Surrogate Model demonstrated in a coupled reactive transport HPC simulation
Max Lübke, Marco De Lucia, Steffen Christgau, Stefan Petri, Bettina Schnor
TL;DR
This work addresses the need for fast, scalable surrogate caching in HPC by proposing a fully distributed MPI-based DHT to serve as a surrogate cache within POET, a coupled reactive transport simulator. It presents three MPI-DHT variants (coarse-grained, fine-grained, and lock-free) and compares them against a server-based DAOS store, showing the lock-free design delivers the best read/write throughput and substantial POET runtime reductions (up to 42%). Synthetic benchmarks demonstrate up to ~16 million reads/s and ~15 million writes/s for the lock-free DHT, while the other two approaches suffer from locking-induced bottlenecks. The results establish a practical, high-performance cache mechanism for HPC simulations, with open-source implementations and clear guidance on synchronization strategies for distributed in-memory key-value storage.
Abstract
Surrogate models can play a pivotal role in enhancing performance in contemporary High-Performance Computing applications. Cache-based surrogates use already calculated simulation results to interpolate or extrapolate further simulation output values. But this approach only pays off if the access time to retrieve the needed values is much faster than the actual simulation. While the most existing key-value stores use a Client-Server architecture with dedicated storage nodes, this is not the most suitable architecture for HPC applications. Instead, we propose a distributed architecture where the parallel processes offer a part of their available memory to build a shared distributed hash table based on MPI. This paper presents three DHT approaches with the special requirements of HPC applications in mind. The presented lock-free design outperforms both DHT versions which use explicit synchronization by coarse-grained resp. fine-grained locking. The lock-free DHT shows very good scaling regarding read and write performance. The runtime of a coupled reactive transport simulation was improved between 14% and 42% using the lock-free DHT as a surrogate model.
