CAMformer: Associative Memory is All You Need
Tergel Molom-Ochir, Benjamin F. Morris, Mark Horton, Chiyue Wei, Cong Guo, Brady Taylor, Peter Liu, Shan X. Wang, Deliang Fan, Hai Helen Li, Yiran Chen
TL;DR
CAMformer reimagines Transformer attention as associative memory using a voltage-domain BA-CAM to perform in-memory similarity search with constant latency. It comprises three pipelined stages—association, normalization, and contextualization—that together compute $\hat{A}=\mathrm{SoftMax}(\mathrm{Top\!-32}(QK^\top))$ and then apply $A=\hat{A}V$, while employing hierarchical two-stage top-$k$ filtering to hide DRAM latency. Key contributions include the BA-CAM circuit design, BIMV engine, fully binarized attention scores, and multi-stage pipelining with hierarchical ranking, achieving over 10× energy efficiency, up to 4× throughput, and 6–8× area reduction with near-lossless accuracy. The approach enables scalable, energy-efficient attention for large models and long-context tasks, offering practical impact for deploying attention-heavy Transformer workloads in resource-constrained environments.
Abstract
Transformers face scalability challenges due to the quadratic cost of attention, which involves dense similarity computations between queries and keys. We propose CAMformer, a novel accelerator that reinterprets attention as an associative memory operation and computes attention scores using a voltage-domain Binary Attention Content Addressable Memory (BA-CAM). This enables constant-time similarity search through analog charge sharing, replacing digital arithmetic with physical similarity sensing. CAMformer integrates hierarchical two-stage top-k filtering, pipelined execution, and high-precision contextualization to achieve both algorithmic accuracy and architectural efficiency. Evaluated on BERT and Vision Transformer workloads, CAMformer achieves over 10x energy efficiency, up to 4x higher throughput, and 6-8x lower area compared to state-of-the-art accelerators--while maintaining near-lossless accuracy.
