A Mess of Memory System Benchmarking, Simulation and Application Profiling
Pouya Esmaili-Dokht, Francesco Sgherzi, Valeria Soldera Girelli, Isaac Boixaderas, Mariana Carmin, Alireza Monemi, Adria Armejach, Estanislao Mercadal, German Llort, Petar Radojkovic, Miquel Moreto, Judit Gimenez, Xavier Martorell, Eduard Ayguade, Jesus Labarta, Emanuele Confalonieri, Rishabh Dubey, Jason Adlard
TL;DR
The paper addresses the fragmented landscape of memory performance evaluation by introducing the Memory stress (Mess) framework, which unifies benchmarking, simulation, and application profiling around a family of bandwidth–latency curves. The Mess benchmark delivers close-to-hardware memory characterization across mixed read/write traffic and scales from unloaded to saturated memory systems, while the Mess simulator uses these curves in a feedback-control loop to achieve fast yet accurate memory modeling. Across actual systems and multiple simulators (ZSim, gem5, OpenPiton Metro-MPI), Mess demonstrates high fidelity (errors as low as $0.4\%$–$6\%$ in some cases) and practical adoption for emerging technologies such as CXL memory expanders. The framework is open-source and integrated with production HPC profiling tools, enabling researchers and developers to analyze, compare, and design memory systems with a unified, technology-agnostic approach. The work’s significance lies in accelerating memory-technology exploration and providing reliable, interpretable insights into memory bottlenecks for real-world HPC workloads.
Abstract
The Memory stress (Mess) framework provides a unified view of the memory system benchmarking, simulation and application profiling. The Mess benchmark provides a holistic and detailed memory system characterization. It is based on hundreds of measurements that are represented as a family of bandwidth--latency curves. The benchmark increases the coverage of all the previous tools and leads to new findings in the behavior of the actual and simulated memory systems. We deploy the Mess benchmark to characterize Intel, AMD, IBM, Fujitsu, Amazon and NVIDIA servers with DDR4, DDR5, HBM2 and HBM2E memory. The Mess memory simulator uses bandwidth--latency concept for the memory performance simulation. We integrate Mess with widely-used CPUs simulators enabling modeling of all high-end memory technologies. The Mess simulator is fast, easy to integrate and it closely matches the actual system performance. By design, it enables a quick adoption of new memory technologies in hardware simulators. Finally, the Mess application profiling positions the application in the bandwidth--latency space of the target memory system. This information can be correlated with other application runtime activities and the source code, leading to a better overall understanding of the application's behavior. The current Mess benchmark release covers all major CPU and GPU ISAs, x86, ARM, Power, RISC-V, and NVIDIA's PTX. We also release as open source the ZSim, gem5 and OpenPiton Metro-MPI integrated with the Mess simulator for DDR4, DDR5, Optane, HBM2, HBM2E and CXL memory expanders. The Mess application profiling is already integrated into a suite of production HPC performance analysis tools.
