Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models

Wenqi Jiang; Marco Zeller; Roger Waleffe; Torsten Hoefler; Gustavo Alonso

Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models

Wenqi Jiang, Marco Zeller, Roger Waleffe, Torsten Hoefler, Gustavo Alonso

TL;DR

Chameleon tackles the inefficiency of retrieval-augmented language modeling by deploying a heterogeneous, disaggregated accelerator stack that separately scales LLM inference and vector search. ChamVS provides near-memory FPGA-based PQ decoding and an AHPQ-driven K-selection engine, while ChamLM runs multi-GPU inference; a CPU coordinator orchestrates end-to-end retrieval. The approach achieves substantial gains (up to $2.16\times$ latency and $3.18\times$ throughput for end-to-end inference, and up to $23.72\times$ latency reduction for vector search) and demonstrates that optimal accelerator ratios vary across configurations, justifying disaggregation. This work signals a practical path to high-performance RALMs in environments with very large vector stores and diverse retrieval schedules.

Abstract

A Retrieval-Augmented Language Model (RALM) combines a large language model (LLM) with a vector database to retrieve context-specific knowledge during text generation. This strategy facilitates impressive generation quality even with smaller models, thus reducing computational demands by orders of magnitude. To serve RALMs efficiently and flexibly, we propose Chameleon, a heterogeneous accelerator system integrating both LLM and vector search accelerators in a disaggregated architecture. The heterogeneity ensures efficient serving for both inference and retrieval, while the disaggregation allows independent scaling of LLM and vector search accelerators to fulfill diverse RALM requirements. Our Chameleon prototype implements vector search accelerators on FPGAs and assigns LLM inference to GPUs, with CPUs as cluster coordinators. Evaluated on various RALMs, Chameleon exhibits up to 2.16$\times$ reduction in latency and 3.18x speedup in throughput compared to the hybrid CPU-GPU architecture. The promising results pave the way for adopting heterogeneous accelerators for not only LLM inference but also vector search in future RALM systems.

Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models

TL;DR

latency and

throughput for end-to-end inference, and up to

latency reduction for vector search) and demonstrates that optimal accelerator ratios vary across configurations, justifying disaggregation. This work signals a practical path to high-performance RALMs in environments with very large vector stores and diverse retrieval schedules.

Abstract

reduction in latency and 3.18x speedup in throughput compared to the hybrid CPU-GPU architecture. The promising results pave the way for adopting heterogeneous accelerators for not only LLM inference but also vector search in future RALM systems.

Paper Structure (19 sections, 13 figures, 4 tables)

This paper contains 19 sections, 13 figures, 4 tables.

Introduction
Background and Motivation
Retrieval-Augmented Language Models
Large-Scale Vector Search
Motivation: Efficient RALM Inference
Chameleon: System Overview
ChamVS Near-Memory Accelerator
PQ Decoding Units
Efficient $K$-Selection Module
Primitive: Systolic Priority Queue
Approximate Hierarchical Priority Queue (AHPQ)
Memory Management and Load Balancing
Implementation
Evaluation
Experimental Setup
...and 4 more sections

Figures (13)

Figure 1: A retrieval-augmented language model (RALM).
Figure 2: Product quantization (PQ) for vector search.
Figure 3: Chameleon is a heterogeneous and disaggregated accelerator system for efficient RALM inference.
Figure 4: The ChamVS near-memory retrieval accelerator.
Figure 5: The architecture design of a PQ decoding unit.
...and 8 more figures

Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models

TL;DR

Abstract

Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (13)