Table of Contents
Fetching ...

Modeling the Potential of Message-Free Communication via CXL.mem

Stepan Vanecek, Matthew Turner, Manisha Gajbe, Matthew Wolf, Martin Schulz

TL;DR

This work addresses the memory-wall challenge by predicting when CXL.mem-based, message-free data exchange can outperform traditional MPI messaging. It combines an extended Mitos-based memory-access analysis with a per-MPI-call performance model that also accounts for cross-node transfers, enabling precise, call-level optimizations. Validations on a 2D heat-transfer miniapp and HPCG demonstrate that either data-transfer or data-access overhead can dominate depending on workload and problem size, and multi-node predictions indicate substantial potential gains. The approach provides actionable guidance for developers to prioritize refactoring where CXL.mem offers the most benefit while highlighting design considerations for future memory-pooling architectures.

Abstract

Heterogeneous memory technologies are increasingly important instruments in addressing the memory wall in HPC systems. While most are deployed in single node setups, CXL.mem is a technology that implements memories that can be attached to multiple nodes simultaneously, enabling shared memory pooling. This opens new possibilities, particularly for efficient inter-node communication. In this paper, we present a novel performance evaluation toolchain combined with an extended performance model for message-based communication, which can be used to predict potential performance benefits from using CXL.mem for data exchange. Our approach analyzes data access patterns of MPI applications: it analyzes on-node accesses to/from MPI buffers, as well as cross-node MPI traffic to gather a full understanding of the impact of memory performance. We combine this data in an extended performance model to predict which data transfers could benefit from direct CXL.mem implementations as compared to traditional MPI messages. Our model works on a per-MPI call granularity, allowing the identification and later optimizations of those MPI invocations in the code with the highest potential for speedup by using CXL.mem. For our toolchain, we extend the memory trace sampling tool Mitos and use it to extract data access behavior. In the post-processing step, the raw data is automatically analyzed to provide performance models for each individual MPI call. We validate the models on two sample applications -- a 2D heat transfer miniapp and the HPCG benchmark -- and use them to demonstrate their support for targeted optimizations by integrating CXL.mem.

Modeling the Potential of Message-Free Communication via CXL.mem

TL;DR

This work addresses the memory-wall challenge by predicting when CXL.mem-based, message-free data exchange can outperform traditional MPI messaging. It combines an extended Mitos-based memory-access analysis with a per-MPI-call performance model that also accounts for cross-node transfers, enabling precise, call-level optimizations. Validations on a 2D heat-transfer miniapp and HPCG demonstrate that either data-transfer or data-access overhead can dominate depending on workload and problem size, and multi-node predictions indicate substantial potential gains. The approach provides actionable guidance for developers to prioritize refactoring where CXL.mem offers the most benefit while highlighting design considerations for future memory-pooling architectures.

Abstract

Heterogeneous memory technologies are increasingly important instruments in addressing the memory wall in HPC systems. While most are deployed in single node setups, CXL.mem is a technology that implements memories that can be attached to multiple nodes simultaneously, enabling shared memory pooling. This opens new possibilities, particularly for efficient inter-node communication. In this paper, we present a novel performance evaluation toolchain combined with an extended performance model for message-based communication, which can be used to predict potential performance benefits from using CXL.mem for data exchange. Our approach analyzes data access patterns of MPI applications: it analyzes on-node accesses to/from MPI buffers, as well as cross-node MPI traffic to gather a full understanding of the impact of memory performance. We combine this data in an extended performance model to predict which data transfers could benefit from direct CXL.mem implementations as compared to traditional MPI messages. Our model works on a per-MPI call granularity, allowing the identification and later optimizations of those MPI invocations in the code with the highest potential for speedup by using CXL.mem. For our toolchain, we extend the memory trace sampling tool Mitos and use it to extract data access behavior. In the post-processing step, the raw data is automatically analyzed to provide performance models for each individual MPI call. We validate the models on two sample applications -- a 2D heat transfer miniapp and the HPCG benchmark -- and use them to demonstrate their support for targeted optimizations by integrating CXL.mem.

Paper Structure

This paper contains 39 sections, 10 equations, 10 figures.

Figures (10)

  • Figure 1: Workflow of Mitoshooks sampling MPI applications. The additions specifically for modeling (beyond Mitos performance analysis focus) are on the green background.
  • Figure 2: Illustrative examples of load parallelism under latency-limited and bandwidth-limited conditions.
  • Figure 3: Comparison of DDR and CXL performance across different data sources (cache hits, LFB, memory).
  • Figure 4: Data allocation and communication paths
  • Figure 5: Comparison of model predictions and reference implementation for 2D stencil with N+S and W+E halo exchanges using shared memory.
  • ...and 5 more figures