Table of Contents
Fetching ...

CQ-CiM: Hardware-Aware Embedding Shaping for Robust CiM-Based Retrieval

Xinzhao Li, Alptekin Vardar, Franz Müller, Navya Goli, Umamaheswara Tida, Kai Ni, X. Sharon Hu, Thomas Kämpfe, Ruiyang Qin

TL;DR

CQ-CiM is introduced, a unified, hardware-aware data shaping framework that jointly learns Compression and Quantization to produce CiM-compatible low-bit embeddings for diverse CiM designs, and is the first work to shape data for comprehensive CiM usage on RAG.

Abstract

Deploying Retrieval-Augmented Generation (RAG) on edge devices is in high demand, but is hindered by the latency of massive data movement and computation on traditional architectures. Compute-in-Memory (CiM) architectures address this bottleneck by performing vector search directly within their crossbar structure. However, CiM's adoption for RAG is limited by a fundamental ``representation gap,'' as high-precision, high-dimension embeddings are incompatible with CiM's low-precision, low-dimension array constraints. This gap is compounded by the diversity of CiM implementations (e.g., SRAM, ReRAM, FeFET), each with unique designs (e.g., 2-bit cells, 512x512 arrays). Consequently, RAG data must be naively reshaped to fit each target implementation. Current data shaping methods handle dimension and precision disjointly, which degrades data fidelity. This not only negates the advantages of CiM for RAG but also confuses hardware designers, making it unclear if a failure is due to the circuit design or the degraded input data. As a result, CiM adoption remains limited. In this paper, we introduce CQ-CiM, a unified, hardware-aware data shaping framework that jointly learns Compression and Quantization to produce CiM-compatible low-bit embeddings for diverse CiM designs. To the best of our knowledge, this is the first work to shape data for comprehensive CiM usage on RAG.

CQ-CiM: Hardware-Aware Embedding Shaping for Robust CiM-Based Retrieval

TL;DR

CQ-CiM is introduced, a unified, hardware-aware data shaping framework that jointly learns Compression and Quantization to produce CiM-compatible low-bit embeddings for diverse CiM designs, and is the first work to shape data for comprehensive CiM usage on RAG.

Abstract

Deploying Retrieval-Augmented Generation (RAG) on edge devices is in high demand, but is hindered by the latency of massive data movement and computation on traditional architectures. Compute-in-Memory (CiM) architectures address this bottleneck by performing vector search directly within their crossbar structure. However, CiM's adoption for RAG is limited by a fundamental ``representation gap,'' as high-precision, high-dimension embeddings are incompatible with CiM's low-precision, low-dimension array constraints. This gap is compounded by the diversity of CiM implementations (e.g., SRAM, ReRAM, FeFET), each with unique designs (e.g., 2-bit cells, 512x512 arrays). Consequently, RAG data must be naively reshaped to fit each target implementation. Current data shaping methods handle dimension and precision disjointly, which degrades data fidelity. This not only negates the advantages of CiM for RAG but also confuses hardware designers, making it unclear if a failure is due to the circuit design or the degraded input data. As a result, CiM adoption remains limited. In this paper, we introduce CQ-CiM, a unified, hardware-aware data shaping framework that jointly learns Compression and Quantization to produce CiM-compatible low-bit embeddings for diverse CiM designs. To the best of our knowledge, this is the first work to shape data for comprehensive CiM usage on RAG.
Paper Structure (15 sections, 5 equations, 7 figures, 3 tables)

This paper contains 15 sections, 5 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Illustration of CiM-based embedding retrieval on a FeFET crossbar array within a RAG pipline
  • Figure 2: Experimental characterization and functionality of the FeFET-based compute-in-memory array in 28 nm CMOS. (a) Layout of the mixed-signal FeFET crossbar array (b) fabricated die photograph, (c) measured bitline current accumulation under progressive wordline activation, and (d) programmed 2-bit multi-level FeFET operation of 50 devices exhibiting four distinct $V_\mathrm{T}$ states ($L_0$–$L_3$).
  • Figure 3: Overview of CQ-CiM. Left: The framework jointly shapes embedding dimension and precision via a LoRA-based adapter with noise injection for hardware robustness. Right: Demonstrate that the trained embedding model based on our framework (CQ-CiM) is used to bridge the CiM and RAG.
  • Figure 4: Visualization of quantized values of a single embedding (dimension=25) using fixed nonuniform threshold (left) and learned N2UQ threshold (right)
  • Figure 5: Ablation study on ArguAna. Retrieval performance across LoRA settings, compression methods, and quantization strategies. LoRA (8,16) + Dense + N2UQ achieves the strongest retrieval accuracy.
  • ...and 2 more figures