Table of Contents
Fetching ...

EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation

Weigao Su, Vishal Shrivastav

TL;DR

EDM addresses the latency bottleneck of Ethernet-based memory disaggregation by moving the remote-memory protocol into the Ethernet PHY and introducing a centralized in-network scheduler in the switch. The two core ideas—an in-PHY memory stack and a PHY-level scheduler using a Parallel Iterative Matching approach—eliminate MAC-layer overhead and switch-queueing, achieving approximately $\sim$300 ns unloaded latency and remaining within $1.3\times$ of that under load. Hardware experiments and large-scale simulations show substantial latency and throughput improvements over RDMA-based fabrics like RoCEv2 and strong competitiveness with PCIe-based CXL under many conditions, highlighting EDM’s potential for scalable, low-cost rack- or cluster-scale memory disaggregation over Ethernet.

Abstract

Achieving low remote memory access latency remains the primary challenge in realizing memory disaggregation over Ethernet within the datacenters. We present EDM that attempts to overcome this challenge using two key ideas. First, while existing network protocols for remote memory access over the Ethernet, such as TCP/IP and RDMA, are implemented on top of the MAC layer, EDM takes a radical approach by implementing the entire network protocol stack for remote memory access within the Physical layer (PHY) of the Ethernet. This overcomes fundamental latency and bandwidth overheads imposed by the MAC layer, especially for small memory messages. Second, EDM implements a centralized, fast, in-network scheduler for memory traffic within the PHY of the Ethernet switch. Inspired by the classic Parallel Iterative Matching (PIM) algorithm, the scheduler dynamically reserves bandwidth between compute and memory nodes by creating virtual circuits in the PHY, thus eliminating queuing delay and layer 2 packet processing delay at the switch for memory traffic, while maintaining high bandwidth utilization. Our FPGA testbed demonstrates that EDM's network fabric incurs a latency of only $\sim$300 ns for remote memory access in an unloaded network, which is an order of magnitude lower than state-of-the-art Ethernet-based solutions such as RoCEv2 and comparable to emerging PCIe-based solutions such as CXL. Larger-scale network simulations indicate that even at high network loads, EDM's average latency remains within 1.3$\times$ its unloaded latency.

EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation

TL;DR

EDM addresses the latency bottleneck of Ethernet-based memory disaggregation by moving the remote-memory protocol into the Ethernet PHY and introducing a centralized in-network scheduler in the switch. The two core ideas—an in-PHY memory stack and a PHY-level scheduler using a Parallel Iterative Matching approach—eliminate MAC-layer overhead and switch-queueing, achieving approximately 300 ns unloaded latency and remaining within of that under load. Hardware experiments and large-scale simulations show substantial latency and throughput improvements over RDMA-based fabrics like RoCEv2 and strong competitiveness with PCIe-based CXL under many conditions, highlighting EDM’s potential for scalable, low-cost rack- or cluster-scale memory disaggregation over Ethernet.

Abstract

Achieving low remote memory access latency remains the primary challenge in realizing memory disaggregation over Ethernet within the datacenters. We present EDM that attempts to overcome this challenge using two key ideas. First, while existing network protocols for remote memory access over the Ethernet, such as TCP/IP and RDMA, are implemented on top of the MAC layer, EDM takes a radical approach by implementing the entire network protocol stack for remote memory access within the Physical layer (PHY) of the Ethernet. This overcomes fundamental latency and bandwidth overheads imposed by the MAC layer, especially for small memory messages. Second, EDM implements a centralized, fast, in-network scheduler for memory traffic within the PHY of the Ethernet switch. Inspired by the classic Parallel Iterative Matching (PIM) algorithm, the scheduler dynamically reserves bandwidth between compute and memory nodes by creating virtual circuits in the PHY, thus eliminating queuing delay and layer 2 packet processing delay at the switch for memory traffic, while maintaining high bandwidth utilization. Our FPGA testbed demonstrates that EDM's network fabric incurs a latency of only 300 ns for remote memory access in an unloaded network, which is an order of magnitude lower than state-of-the-art Ethernet-based solutions such as RoCEv2 and comparable to emerging PCIe-based solutions such as CXL. Larger-scale network simulations indicate that even at high network loads, EDM's average latency remains within 1.3 its unloaded latency.

Paper Structure

This paper contains 40 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Memory disaggregation over Ethernet.
  • Figure 2: Data path for memory traffic in EDM vs. in existing Ethernet fabrics for memory disaggregation.
  • Figure 3: EDM network stack.
  • Figure 4: Testbed setup.
  • Figure 5: Breakdown of latency for EDM's network fabric for 64B read and write. TD+PD = transmission+propagation delay. A clock cycle is 2.56ns. For details on the cycle numbers refer to \ref{['sec:design:stack:host']} and \ref{['sec:design:stack:switch']}.
  • ...and 3 more figures