Table of Contents
Fetching ...

Count2Multiply: Reliable In-Memory High-Radix Counting

João Paulo Cardoso de Lima, Benjamin Franklin Morris, Asif Ali Khan, Jeronimo Castrillon, Alex K. Jones

TL;DR

Count2Multiply introduces a high-radix, in-memory counting framework for digital CIM that leverages Johnson counters and XOR-homomorphic ECC to achieve reliable, low-latency multiplication and related operations. The method broadcasts X-derived digit increments as memory command sequences to in-memory counters storing Z masks, enabling efficient integer-vector/matrix and tensor-style computations in DRAM and, with extensions, NVMs. Key contributions include (i) high-radix counter design with optimized increment paths (including k-ary and IARM), (ii) a host-assisted execution model that supports vector/matrix and matrix-matrix products through broadcast-accumulate, and (iii) a fault-tolerance scheme that integrates with conventional ECC to reduce recomputation overhead relative to TMR. Experimental results show up to $10\times$ speedup, $8\times$ GOPS/W, and $9.5\times$ GOPS/area versus state-of-the-art in-DRAM CIM, with strong performance in sparse workloads and competitive results against GPUs for several kernels, underscoring Count2Multiply’s practical impact for energy-efficient CIM accelerators.

Abstract

Computing-in-memory (CIM) has been demonstrated across various memory technologies, ranging from memristive crossbars performing analog dot-product computations to large-scale digital bitwise operations in commodity DRAM and other proposed non-volative memory technologies. However, current CIM solutions face latency and reliability challenges. CIM fidelity lags considerably behind access fidelity. Furthermore, bulk-bitwise CIM, although highly parallelized, requires long latency for operations like multiplication and addition, due to their bit-serial computation. This paper presents Count2Multiply, a technology-agnostic digital CIM approach to perform multiplication, addition and other operations using high-radix, massively parallel counting enabled by CIM bulk-bitwise logic operations. Designed to meet fault tolerance requirements, Count2Multiply integrates traditional row-wise error correction codes, such as Hamming and BCH, to address the high error rates in existing CIM designs. We demonstrate Count2Multiply with a detailed application to CIM in conventional DRAM due to its ubiquity and high endurance. However, we note that the Count2Multiply architecture is compatible with other functionally complete CIM proposals. Compared to the state-of-the-art in-DRAM CIM method, Count2Multiply achieves up to 10x speedup, 8x higher GOPS/Watt, and 9.5x higher GOPS/area, while outperforming GPU for vector-matrix multiplications.

Count2Multiply: Reliable In-Memory High-Radix Counting

TL;DR

Count2Multiply introduces a high-radix, in-memory counting framework for digital CIM that leverages Johnson counters and XOR-homomorphic ECC to achieve reliable, low-latency multiplication and related operations. The method broadcasts X-derived digit increments as memory command sequences to in-memory counters storing Z masks, enabling efficient integer-vector/matrix and tensor-style computations in DRAM and, with extensions, NVMs. Key contributions include (i) high-radix counter design with optimized increment paths (including k-ary and IARM), (ii) a host-assisted execution model that supports vector/matrix and matrix-matrix products through broadcast-accumulate, and (iii) a fault-tolerance scheme that integrates with conventional ECC to reduce recomputation overhead relative to TMR. Experimental results show up to speedup, GOPS/W, and GOPS/area versus state-of-the-art in-DRAM CIM, with strong performance in sparse workloads and competitive results against GPUs for several kernels, underscoring Count2Multiply’s practical impact for energy-efficient CIM accelerators.

Abstract

Computing-in-memory (CIM) has been demonstrated across various memory technologies, ranging from memristive crossbars performing analog dot-product computations to large-scale digital bitwise operations in commodity DRAM and other proposed non-volative memory technologies. However, current CIM solutions face latency and reliability challenges. CIM fidelity lags considerably behind access fidelity. Furthermore, bulk-bitwise CIM, although highly parallelized, requires long latency for operations like multiplication and addition, due to their bit-serial computation. This paper presents Count2Multiply, a technology-agnostic digital CIM approach to perform multiplication, addition and other operations using high-radix, massively parallel counting enabled by CIM bulk-bitwise logic operations. Designed to meet fault tolerance requirements, Count2Multiply integrates traditional row-wise error correction codes, such as Hamming and BCH, to address the high error rates in existing CIM designs. We demonstrate Count2Multiply with a detailed application to CIM in conventional DRAM due to its ubiquity and high endurance. However, we note that the Count2Multiply architecture is compatible with other functionally complete CIM proposals. Compared to the state-of-the-art in-DRAM CIM method, Count2Multiply achieves up to 10x speedup, 8x higher GOPS/Watt, and 9.5x higher GOPS/area, while outperforming GPU for vector-matrix multiplications.
Paper Structure (38 sections, 1 equation, 19 figures, 3 tables, 2 algorithms)

This paper contains 38 sections, 1 equation, 19 figures, 3 tables, 2 algorithms.

Figures (19)

  • Figure 1: Count2Multiply overview (a) integer vector, binary matrix multiplication example (b) DRAM subarray with counters and masks mapping, and Ambit's rows groups, i.e., computing (B-group), control (C-group) and data (D-group).
  • Figure 2: DRAM organization
  • Figure 3: Input distribution in DNA pre-alignment filtering and BERT language model. Values are small (circa 4--8 bits).
  • Figure 4: Fault rate impact on application accuracy.
  • Figure 5: $C$, 5-bit JCs in memory: (a) before counting; (b) all counters count; (c) masked counting; (d) multi-digit counting.
  • ...and 14 more figures