Table of Contents
Fetching ...

CAMformer: Associative Memory is All You Need

Tergel Molom-Ochir, Benjamin F. Morris, Mark Horton, Chiyue Wei, Cong Guo, Brady Taylor, Peter Liu, Shan X. Wang, Deliang Fan, Hai Helen Li, Yiran Chen

TL;DR

CAMformer reimagines Transformer attention as associative memory using a voltage-domain BA-CAM to perform in-memory similarity search with constant latency. It comprises three pipelined stages—association, normalization, and contextualization—that together compute $\hat{A}=\mathrm{SoftMax}(\mathrm{Top\!-32}(QK^\top))$ and then apply $A=\hat{A}V$, while employing hierarchical two-stage top-$k$ filtering to hide DRAM latency. Key contributions include the BA-CAM circuit design, BIMV engine, fully binarized attention scores, and multi-stage pipelining with hierarchical ranking, achieving over 10× energy efficiency, up to 4× throughput, and 6–8× area reduction with near-lossless accuracy. The approach enables scalable, energy-efficient attention for large models and long-context tasks, offering practical impact for deploying attention-heavy Transformer workloads in resource-constrained environments.

Abstract

Transformers face scalability challenges due to the quadratic cost of attention, which involves dense similarity computations between queries and keys. We propose CAMformer, a novel accelerator that reinterprets attention as an associative memory operation and computes attention scores using a voltage-domain Binary Attention Content Addressable Memory (BA-CAM). This enables constant-time similarity search through analog charge sharing, replacing digital arithmetic with physical similarity sensing. CAMformer integrates hierarchical two-stage top-k filtering, pipelined execution, and high-precision contextualization to achieve both algorithmic accuracy and architectural efficiency. Evaluated on BERT and Vision Transformer workloads, CAMformer achieves over 10x energy efficiency, up to 4x higher throughput, and 6-8x lower area compared to state-of-the-art accelerators--while maintaining near-lossless accuracy.

CAMformer: Associative Memory is All You Need

TL;DR

CAMformer reimagines Transformer attention as associative memory using a voltage-domain BA-CAM to perform in-memory similarity search with constant latency. It comprises three pipelined stages—association, normalization, and contextualization—that together compute and then apply , while employing hierarchical two-stage top- filtering to hide DRAM latency. Key contributions include the BA-CAM circuit design, BIMV engine, fully binarized attention scores, and multi-stage pipelining with hierarchical ranking, achieving over 10× energy efficiency, up to 4× throughput, and 6–8× area reduction with near-lossless accuracy. The approach enables scalable, energy-efficient attention for large models and long-context tasks, offering practical impact for deploying attention-heavy Transformer workloads in resource-constrained environments.

Abstract

Transformers face scalability challenges due to the quadratic cost of attention, which involves dense similarity computations between queries and keys. We propose CAMformer, a novel accelerator that reinterprets attention as an associative memory operation and computes attention scores using a voltage-domain Binary Attention Content Addressable Memory (BA-CAM). This enables constant-time similarity search through analog charge sharing, replacing digital arithmetic with physical similarity sensing. CAMformer integrates hierarchical two-stage top-k filtering, pipelined execution, and high-precision contextualization to achieve both algorithmic accuracy and architectural efficiency. Evaluated on BERT and Vision Transformer workloads, CAMformer achieves over 10x energy efficiency, up to 4x higher throughput, and 6-8x lower area compared to state-of-the-art accelerators--while maintaining near-lossless accuracy.

Paper Structure

This paper contains 25 sections, 1 equation, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Attention as a key-lock mechanism. The query vector (Q) is used to determine similarity with stored keys (K), producing an attention score. The resulting soft selection determines which value vectors (V) to aggregate. This metaphor illustrates attention as an associative memory operation, where queries “unlock” relevant stored information.
  • Figure 2: Array-level architecture of an example 2×6 BA-CAM module's array used for binary attention computation. Each row in the array performs parallel similarity matching against the broadcast query, with charge sharing accumulating match strength on shared matchlines. The inset shows the 10T1C CAM cell structure and an illustrative example of binary attention scoring based on Hamming similarity.
  • Figure 3: (a) Matchline voltage traces for varying partial matches in 1×10 BA-CAM. (b) PVT analysis across corners for 16×64 array.
  • Figure 4: Illustration of matrix-vector multiplication. Comparison of conventional (left top) versus CAM-based (left bottom). Tiling steps for larger matrix-vector operations (right).
  • Figure 5: Per-op energy vs. matrix dimension $M$ in BA-CAM. Larger $M$ reduces energy by amortizing programming cost. Dashed lines show search-only and total energy bounds.
  • ...and 5 more figures