MXFormer: A Microscaling Floating-Point Charge-Trap Transistor Compute-in-Memory Transformer Accelerator

George Karfakis; Samyak Chakrabarty; Vinod Kurian Jacob; Siyun Qiao; Subramanian S. Iyer; Sudhakar Pamarti; Puneet Gupta

MXFormer: A Microscaling Floating-Point Charge-Trap Transistor Compute-in-Memory Transformer Accelerator

George Karfakis, Samyak Chakrabarty, Vinod Kurian Jacob, Siyun Qiao, Subramanian S. Iyer, Sudhakar Pamarti, Puneet Gupta

TL;DR

Transformer inference for short sequences is limited by weight movement and memory bandwidth. MXFormer combines fully weight-stationary CIM with MXFP4 data formats and per-block exponent alignment to map static linear layers to analog CTT arrays while routing dynamic attention through digital systolic blocks, achieving near-digital accuracy with PTQ-only operation. The design delivers up to 58k FPS on ViT-L/32 across two chips, 20.9x higher TOPS/mm$^2$ density versus comparable FWS accelerators, and 3.3x–60.5x improvements over non-FWS digital/hybrid approaches, with less than 1% accuracy loss and no retraining. This approach enables dense, on-die storage of hundreds of millions of parameters and highly efficient, low-latency fixed-model transformer inference suitable for automotive and other safety-critical contexts.

Abstract

The proliferation of Transformer models is often constrained by the significant computational and memory bandwidth demands of deployment. To address this, we present MXFormer, a novel, hybrid, weight-stationary Compute-in-Memory (CIM) accelerator that provides high throughput and efficiency for fixed-model inference on large short-sequence Transformers. Our architecture's foundation is the use of ultra-dense Charge-Trap Transistors (CTTs) in Microscaling MXFP4 CIM arrays, uniquely enabling the on-chip storage of up to hundreds of millions of parameters in Fully Weight Stationary (FWS) fashion. We introduce a statically partitioned design with 12 Transformer blocks connected by a deeply pipelined dataflow. Static-weight layers (MLPs and linear projections) execute on highly parallel analog CTT arrays using an MXFP4-native flow with per-block exponent alignment and a 10-bit SAR ADC. Dynamic computations are handled in fully accurate digital blocks that utilize MXFP-enabled systolic arrays for scaled dot-product attention and vector units for LayerNorm and FlashAttention-style Softmax. By eliminating all weight movement, the deeply pipelined MXFormer architecture yields very high single-stream throughput and efficiency, processing 58275 FPS on ViT-L/32 (dual-chip) or 41269 FPS on ViT-B/16 (single chip). MXFormer outperforms comparable state-of-the-art non-FWS digital, hybrid and photonic Transformer accelerators ~3.3x-60.5x in compute density and ~1.7x-2.5x in energy efficiency. Against FWS accelerators, MXFormer improves compute density by ~20.9x and resident weight storage density by ~2x, while preserving near-digital accuracy (drop of <1%) without any model retraining.

MXFormer: A Microscaling Floating-Point Charge-Trap Transistor Compute-in-Memory Transformer Accelerator

TL;DR

density versus comparable FWS accelerators, and 3.3x–60.5x improvements over non-FWS digital/hybrid approaches, with less than 1% accuracy loss and no retraining. This approach enables dense, on-die storage of hundreds of millions of parameters and highly efficient, low-latency fixed-model transformer inference suitable for automotive and other safety-critical contexts.

Abstract

Paper Structure (38 sections, 5 equations, 12 figures, 10 tables)

This paper contains 38 sections, 5 equations, 12 figures, 10 tables.

Introduction
Background and Motivation
Transformer
Transformer Architecture
Transformer Workloads and Target Domain
Fully Weight Stationary (FWS) execution
Microscaling Data Formats
Charge Trap Transistor
Device Characteristics
CTT for Analog Compute-in-Memory
Comparison with Other Analog NVM Technologies
MXFormer CTT-CIM Macro Architecture
Macro Topology and Organization
Bias Handling & MXFP Exponent Alignment
Exponent Target Selection Strategies
...and 23 more sections

Figures (12)

Figure 1: Transformer Layer. Blue components represent linear layers with analog NVM-mappable static weights. Red components utilize dynamic weights and require digital compute units. Layers are connected sequentially to form a full Transformer.
Figure 2: Comparison of Static (mappable to NVM CIM) vs Dynamic (requires digital compute) for various short-sequence length models ($y$-axis starts at 70%). The sequence length considered is the maximum supported by each model.
Figure 3: Left-to-right: a) Weight block, b) MXFP Block, and c) Macro architecture of the proposed CTT CIM.
Figure 4: Logical schematic of the MXFP block.
Figure 5: A comparison of online and offline exponent target selection strategies. As shown, "Row Hist 2-Pass" is effectively identical to "Row Hist" at half the CM Correction Bits. The ADC is not modeled.
...and 7 more figures

MXFormer: A Microscaling Floating-Point Charge-Trap Transistor Compute-in-Memory Transformer Accelerator

TL;DR

Abstract

MXFormer: A Microscaling Floating-Point Charge-Trap Transistor Compute-in-Memory Transformer Accelerator

Authors

TL;DR

Abstract

Table of Contents

Figures (12)