Handling of Memory Page Faults during Virtual-Address RDMA
Antonis Psistakis
TL;DR
This work tackles the challenge of page faults during user‑level, zero‑copy RDMA by introducing a hardware–software page fault handling mechanism integrated with the ARM SMMU and the ExaNeSt PLDMA. It combines a fault‑handling framework (fault library and SMMU driver), a firmware‑level, fault‑aware PLDMA design, and a software workflow (Netlink/get_user_pages) to trigger fault resolution and retransmission. The approach aims to avoid memory pinning while preserving RDMA performance and memory utilization, and it is evaluated on ExaNeSt Quad‑FPGA hardware against pinning and pre‑faulting strategies. The results indicate latency benefits for on‑demand paging using a Touch‑Ahead strategy (get_user_pages) over touching pages one by one, with strong potential for improved memory efficiency in large, multi‑node clusters.
Abstract
Nowadays, avoiding system calls during cluster communication (e.g., in Data Centers and High Performance Computing) in modern high-speed interconnection networks has become a necessity, due to the high overhead of multiple data copies between kernel and user space. User-level zero-copy Remote Direct Memory Access (RDMA) technologies address this problem by improving performance and reducing system energy consumption. However, traditional RDMA engines cannot tolerate page faults and therefore use various techniques to avoid them. State-of-the-art RDMA approaches typically rely on pinning address spaces or multiple pages per application. This method introduces long-term disadvantages due to increased programming complexity (pinning and unpinning buffers), limits on how much memory can be pinned, and inefficient memory utilization. In addition, pinning does not fully prevent page faults because modern operating systems apply internal optimization mechanisms, such as Transparent Huge Pages (THP), which are enabled by default in Linux. This thesis implements a page-fault handling mechanism integrated with the DMA engine of the ExaNeSt project. Faults are detected by the ARM System Memory Management Unit (SMMU) and resolved through a hardware-software solution that can request retransmission when needed. This mechanism required modifications to the Linux SMMU driver, the development of a new software library, changes to the DMA engine hardware, and adjustments to the DMA scheduling logic. Experiments were conducted on the Quad-FPGA Daughter Board (QFDB) of ExaNeSt, which uses Xilinx Zynq UltraScale+ MPSoCs. Finally, we evaluate our mechanism and compare it against alternatives such as pinning and pre-faulting, and discuss the advantages of our approach.
