Table of Contents
Fetching ...

Offloading to CXL-based Computational Memory

Suyeon Lee, Kangkyu Park, Kwangsik Shin, Ada Gavrilovska

TL;DR

The paper tackles data movement bottlenecks in disaggregated memory by proposing CCM and a novel asynchronous back-streaming protocol. It introduces KAI, a system that enables continuous overlap of data movement and computation between host and CCM, achieving up to 50.4% end-to-end speedups and substantial idle-time reductions. By analyzing dual CCM architectures, workloads, and inefficiencies, the work demonstrates how end-to-end pipeline design can surpass traditional RP and BS offloading models. The findings suggest that asynchronous, device-initiated streaming with lightweight host polling and OoO capabilities can significantly improve general-purpose CCM performance in diverse workloads.

Abstract

CXL-based Computational Memory (CCM) enables near-memory processing within expanded remote memory, presenting opportunities to address data movement costs associated with disaggregated memory systems and to accelerate overall performance. However, existing operation offloading mechanisms are not capable of leveraging the trade-offs of different models based on different CXL protocols. This work first examines these tradeoffs and demonstrates their impact on end-to-end performance and system efficiency for workloads with diverse data and processing requirements. We propose a novel 'Asynchronous Back-Streaming' protocol by carefully layering data and control transfer operations on top of the underlying CXL protocols. We design KAI, a system that realizes the asynchronous back-streaming model that supports asynchronous data movement and lightweight pipelining in host-CCM interactions. Overall, KAI reduces end-to-end runtime by up to 50.4%, and CCM and host idle times by average 22.11x and 3.85x, respectively.

Offloading to CXL-based Computational Memory

TL;DR

The paper tackles data movement bottlenecks in disaggregated memory by proposing CCM and a novel asynchronous back-streaming protocol. It introduces KAI, a system that enables continuous overlap of data movement and computation between host and CCM, achieving up to 50.4% end-to-end speedups and substantial idle-time reductions. By analyzing dual CCM architectures, workloads, and inefficiencies, the work demonstrates how end-to-end pipeline design can surpass traditional RP and BS offloading models. The findings suggest that asynchronous, device-initiated streaming with lightweight host polling and OoO capabilities can significantly improve general-purpose CCM performance in diverse workloads.

Abstract

CXL-based Computational Memory (CCM) enables near-memory processing within expanded remote memory, presenting opportunities to address data movement costs associated with disaggregated memory systems and to accelerate overall performance. However, existing operation offloading mechanisms are not capable of leveraging the trade-offs of different models based on different CXL protocols. This work first examines these tradeoffs and demonstrates their impact on end-to-end performance and system efficiency for workloads with diverse data and processing requirements. We propose a novel 'Asynchronous Back-Streaming' protocol by carefully layering data and control transfer operations on top of the underlying CXL protocols. We design KAI, a system that realizes the asynchronous back-streaming model that supports asynchronous data movement and lightweight pipelining in host-CCM interactions. Overall, KAI reduces end-to-end runtime by up to 50.4%, and CCM and host idle times by average 22.11x and 3.85x, respectively.

Paper Structure

This paper contains 18 sections, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Simplified view of existing CCM partial offloading mechanisms (a, b) and the mechanism proposed in this work (c). Dotted lines represent ACKs/responses for the corresponding memory requests, omitted in (c) as they are unnecessary under our fully asynchronous interaction.
  • Figure 2: Block diagram of a real prototype of CCM device. The device appears as an endpoint that supports the CXL protocols and memory expansion. It integrates both FPGA-based hardwired PFLs and single general-purpose core.
  • Figure 3: Kernels of the attention block in LLM inference, exhibiting different characteristics under the OPT-2.7B model with a token size of 1K.
  • Figure 4: KNN execution with various workload configurations on real hardware, showing stacked runtime ratios of CCM (purple) and host tasks (green).
  • Figure 5: Execution of KNNs ($D_{dim}$, $R_{numRows}$) and graph analytics on M$^2$NDP, using remote polling (RP) and bulk synchronous flow (BS) as offloading mechanisms. Normalized runtime ratios are shown as stacked bars for CCM tasks (purple), data movement (yellow), and host tasks (green).
  • ...and 9 more figures