Table of Contents
Fetching ...

Cosmos: A CXL-Based Full In-Memory System for Approximate Nearest Neighbor Search

Seoyoung Ko, Hyunjeong Shim, Wanju Doh, Sungmin Yun, Jinin So, Yongsuk Kwon, Sang-Soo Park, Si-Dong Roh, Minyong Yoon, Taeksang Song, Jung Ho Ahn

TL;DR

The paper addresses the bottleneck of billion-scale ANNS for retrieval-augmented generation, where traditional DRAM/SSD solutions struggle with capacity and latency while RDMA clusters incur network overhead. It proposes Cosmos, a full in-memory ANNS system that offloads search to compute-capable CXL devices, eliminating host intervention and PCIe traffic. The approach combines rank-level parallel distance computation to exploit DRAM rank parallelism with adjacency-aware data placement to balance cross-device search loads. Evaluations on SIFT1B and DEEP1B show up to 6.72× throughput gains over a baseline CXL system and 2.35× over a state-of-the-art CXL-based solution, demonstrating scalable, low-latency ANNS suitable for RAG pipelines at billion-scale.

Abstract

Retrieval-Augmented Generation (RAG) is crucial for improving the quality of large language models by injecting proper contexts extracted from external sources. RAG requires high-throughput, low-latency Approximate Nearest Neighbor Search (ANNS) over billion-scale vector databases. Conventional DRAM/SSD solutions face capacity/latency limits, whereas specialized hardware or RDMA clusters lack flexibility or incur network overhead. We present Cosmos, integrating general-purpose cores within CXL memory devices for full ANNS offload and introducing rank-level parallel distance computation to maximize memory bandwidth. We also propose an adjacency-aware data placement that balances search loads across CXL devices based on inter-cluster proximity. Evaluations on SIFT1B and DEEP1B traces show that Cosmos achieves up to 6.72x higher throughput than the baseline CXL system and 2.35x over a state-of-the-art CXL-based solution, demonstrating scalability for RAG pipelines.

Cosmos: A CXL-Based Full In-Memory System for Approximate Nearest Neighbor Search

TL;DR

The paper addresses the bottleneck of billion-scale ANNS for retrieval-augmented generation, where traditional DRAM/SSD solutions struggle with capacity and latency while RDMA clusters incur network overhead. It proposes Cosmos, a full in-memory ANNS system that offloads search to compute-capable CXL devices, eliminating host intervention and PCIe traffic. The approach combines rank-level parallel distance computation to exploit DRAM rank parallelism with adjacency-aware data placement to balance cross-device search loads. Evaluations on SIFT1B and DEEP1B show up to 6.72× throughput gains over a baseline CXL system and 2.35× over a state-of-the-art CXL-based solution, demonstrating scalable, low-latency ANNS suitable for RAG pipelines at billion-scale.

Abstract

Retrieval-Augmented Generation (RAG) is crucial for improving the quality of large language models by injecting proper contexts extracted from external sources. RAG requires high-throughput, low-latency Approximate Nearest Neighbor Search (ANNS) over billion-scale vector databases. Conventional DRAM/SSD solutions face capacity/latency limits, whereas specialized hardware or RDMA clusters lack flexibility or incur network overhead. We present Cosmos, integrating general-purpose cores within CXL memory devices for full ANNS offload and introducing rank-level parallel distance computation to maximize memory bandwidth. We also propose an adjacency-aware data placement that balances search loads across CXL devices based on inter-cluster proximity. Evaluations on SIFT1B and DEEP1B traces show that Cosmos achieves up to 6.72x higher throughput than the baseline CXL system and 2.35x over a state-of-the-art CXL-based solution, demonstrating scalability for RAG pipelines.

Paper Structure

This paper contains 5 sections, 3 figures.

Figures (3)

  • Figure 1: Overview of retrieval-augmented generation (RAG) and approximate nearest neighbor search (ANNS).
  • Figure 2: (a) Memory latency hierarchy highlighting the potential of CXL-attached memory as a new tier between DRAM and RDMA/SSD in terms of latency and capacity. (b) Latency breakdown of graph-based ANN search on large-scale datasets (SIFT and DEEP with 100M vectors).
  • Figure 3: (a) Overview of the system architecture. (b) CXL controller architecture featuring a general-purpose core for executing graph-based ANN search via a memory-mapped host interface. (c) Rank-level distance calculation logic that enables parallel L2 and inner product calculations across memory ranks.