Table of Contents
Fetching ...

Pooling Engram Conditional Memory in Large Language Models using CXL

Ruiyang Ma, Teng Ma, Zhiyuan Su, Hantian Zha, Xinpeng Zhao, Xuchun Shang, Xingrui Yi, Zheng Liu, Zhu Cao, An Wu, Zhichong Dou, Ziqian Liu, Daikang Kuang, Guojie Luo

TL;DR

This paper integrates the CXL-based Engram pool into SGLang, achieving near-DRAM end-to-end performance and provides a scalable and cost-efficient storage solution for future Engram-integrated LLMs without compromising inference performance.

Abstract

Engram conditional memory has emerged as a promising component for LLMs by decoupling static knowledge lookup from dynamic computation. Since Engram exhibits sparse access patterns and supports prefetching, its massive embedding tables are well-suited for offloading to lower-tier memory. In this paper, we propose using Compute Express Link (CXL) memory pool for Engram storage. Compared to RDMA, CXL provides fine-grained and low-latency access required by minimal and discrete retrieval patterns of Engram. We integrate the CXL-based Engram pool into SGLang, achieving near-DRAM end-to-end performance. This provides a scalable and cost-efficient storage solution for future Engram-integrated LLMs without compromising inference performance.

Pooling Engram Conditional Memory in Large Language Models using CXL

TL;DR

This paper integrates the CXL-based Engram pool into SGLang, achieving near-DRAM end-to-end performance and provides a scalable and cost-efficient storage solution for future Engram-integrated LLMs without compromising inference performance.

Abstract

Engram conditional memory has emerged as a promising component for LLMs by decoupling static knowledge lookup from dynamic computation. Since Engram exhibits sparse access patterns and supports prefetching, its massive embedding tables are well-suited for offloading to lower-tier memory. In this paper, we propose using Compute Express Link (CXL) memory pool for Engram storage. Compared to RDMA, CXL provides fine-grained and low-latency access required by minimal and discrete retrieval patterns of Engram. We integrate the CXL-based Engram pool into SGLang, achieving near-DRAM end-to-end performance. This provides a scalable and cost-efficient storage solution for future Engram-integrated LLMs without compromising inference performance.
Paper Structure (20 sections, 2 equations, 6 figures, 5 tables)

This paper contains 20 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Brief architecture of Engram.
  • Figure 2: Overview of RDMA/CXL memory pools.
  • Figure 3: Latency for Engram-27B across varying batch size.
  • Figure 4: Overview of CXL-based Engram pooling system.
  • Figure 5: Latency for Engram-27B across varying batch size.
  • ...and 1 more figures