Count2Multiply: Reliable In-Memory High-Radix Counting

João Paulo Cardoso de Lima; Benjamin Franklin Morris; Asif Ali Khan; Jeronimo Castrillon; Alex K. Jones

Count2Multiply: Reliable In-Memory High-Radix Counting

João Paulo Cardoso de Lima, Benjamin Franklin Morris, Asif Ali Khan, Jeronimo Castrillon, Alex K. Jones

TL;DR

Count2Multiply introduces a high-radix, in-memory counting framework for digital CIM that leverages Johnson counters and XOR-homomorphic ECC to achieve reliable, low-latency multiplication and related operations. The method broadcasts X-derived digit increments as memory command sequences to in-memory counters storing Z masks, enabling efficient integer-vector/matrix and tensor-style computations in DRAM and, with extensions, NVMs. Key contributions include (i) high-radix counter design with optimized increment paths (including k-ary and IARM), (ii) a host-assisted execution model that supports vector/matrix and matrix-matrix products through broadcast-accumulate, and (iii) a fault-tolerance scheme that integrates with conventional ECC to reduce recomputation overhead relative to TMR. Experimental results show up to $10\times$ speedup, $8\times$ GOPS/W, and $9.5\times$ GOPS/area versus state-of-the-art in-DRAM CIM, with strong performance in sparse workloads and competitive results against GPUs for several kernels, underscoring Count2Multiply’s practical impact for energy-efficient CIM accelerators.

Abstract

Computing-in-memory (CIM) has been demonstrated across various memory technologies, ranging from memristive crossbars performing analog dot-product computations to large-scale digital bitwise operations in commodity DRAM and other proposed non-volative memory technologies. However, current CIM solutions face latency and reliability challenges. CIM fidelity lags considerably behind access fidelity. Furthermore, bulk-bitwise CIM, although highly parallelized, requires long latency for operations like multiplication and addition, due to their bit-serial computation. This paper presents Count2Multiply, a technology-agnostic digital CIM approach to perform multiplication, addition and other operations using high-radix, massively parallel counting enabled by CIM bulk-bitwise logic operations. Designed to meet fault tolerance requirements, Count2Multiply integrates traditional row-wise error correction codes, such as Hamming and BCH, to address the high error rates in existing CIM designs. We demonstrate Count2Multiply with a detailed application to CIM in conventional DRAM due to its ubiquity and high endurance. However, we note that the Count2Multiply architecture is compatible with other functionally complete CIM proposals. Compared to the state-of-the-art in-DRAM CIM method, Count2Multiply achieves up to 10x speedup, 8x higher GOPS/Watt, and 9.5x higher GOPS/area, while outperforming GPU for vector-matrix multiplications.

Count2Multiply: Reliable In-Memory High-Radix Counting

TL;DR

speedup,

GOPS/W, and

GOPS/area versus state-of-the-art in-DRAM CIM, with strong performance in sparse workloads and competitive results against GPUs for several kernels, underscoring Count2Multiply’s practical impact for energy-efficient CIM accelerators.

Abstract

Paper Structure (38 sections, 1 equation, 19 figures, 3 tables, 2 algorithms)

This paper contains 38 sections, 1 equation, 19 figures, 3 tables, 2 algorithms.

Introduction
Background and Related Work
DRAM Organization and Operation
Compute-In-DRAM
Fault Modes and Fault Tolerance for CIM
Johnson Counters
Motivation
In-Memory High-Radix Counters
Single-Digit Unit Increment
Single-Digit Masked Unit Increment
Overflow Detection in Single-Digit Counters
Multi-Digit Increment
Optimized Counter Design
Variable-Step (k-ary) Increment
Input-Aware Rippling Minimization
...and 23 more sections

Figures (19)

Figure 1: Count2Multiply overview (a) integer vector, binary matrix multiplication example (b) DRAM subarray with counters and masks mapping, and Ambit's rows groups, i.e., computing (B-group), control (C-group) and data (D-group).
Figure 2: DRAM organization
Figure 3: Input distribution in DNA pre-alignment filtering and BERT language model. Values are small (circa 4--8 bits).
Figure 4: Fault rate impact on application accuracy.
Figure 5: $C$, 5-bit JCs in memory: (a) before counting; (b) all counters count; (c) masked counting; (d) multi-digit counting.
...and 14 more figures

Count2Multiply: Reliable In-Memory High-Radix Counting

TL;DR

Abstract

Count2Multiply: Reliable In-Memory High-Radix Counting

Authors

TL;DR

Abstract

Table of Contents

Figures (19)