Cosmos: A CXL-Based Full In-Memory System for Approximate Nearest Neighbor Search
Seoyoung Ko, Hyunjeong Shim, Wanju Doh, Sungmin Yun, Jinin So, Yongsuk Kwon, Sang-Soo Park, Si-Dong Roh, Minyong Yoon, Taeksang Song, Jung Ho Ahn
TL;DR
The paper addresses the bottleneck of billion-scale ANNS for retrieval-augmented generation, where traditional DRAM/SSD solutions struggle with capacity and latency while RDMA clusters incur network overhead. It proposes Cosmos, a full in-memory ANNS system that offloads search to compute-capable CXL devices, eliminating host intervention and PCIe traffic. The approach combines rank-level parallel distance computation to exploit DRAM rank parallelism with adjacency-aware data placement to balance cross-device search loads. Evaluations on SIFT1B and DEEP1B show up to 6.72× throughput gains over a baseline CXL system and 2.35× over a state-of-the-art CXL-based solution, demonstrating scalable, low-latency ANNS suitable for RAG pipelines at billion-scale.
Abstract
Retrieval-Augmented Generation (RAG) is crucial for improving the quality of large language models by injecting proper contexts extracted from external sources. RAG requires high-throughput, low-latency Approximate Nearest Neighbor Search (ANNS) over billion-scale vector databases. Conventional DRAM/SSD solutions face capacity/latency limits, whereas specialized hardware or RDMA clusters lack flexibility or incur network overhead. We present Cosmos, integrating general-purpose cores within CXL memory devices for full ANNS offload and introducing rank-level parallel distance computation to maximize memory bandwidth. We also propose an adjacency-aware data placement that balances search loads across CXL devices based on inter-cluster proximity. Evaluations on SIFT1B and DEEP1B traces show that Cosmos achieves up to 6.72x higher throughput than the baseline CXL system and 2.35x over a state-of-the-art CXL-based solution, demonstrating scalability for RAG pipelines.
