Technology-Circuit-Algorithm Tri-Design for Processing-in-Pixel-in-Memory (P2M)

Md Abdullah-Al Kaiser; Gourav Datta; Sreetama Sarkar; Souvik Kundu; Zihan Yin; Manas Garg; Ajey P. Jacob; Peter A. Beerel; Akhilesh R. Jaiswal

Technology-Circuit-Algorithm Tri-Design for Processing-in-Pixel-in-Memory (P2M)

Md Abdullah-Al Kaiser, Gourav Datta, Sreetama Sarkar, Souvik Kundu, Zihan Yin, Manas Garg, Ajey P. Jacob, Peter A. Beerel, Akhilesh R. Jaiswal

TL;DR

This paper addresses the data-deluge problem in vision systems by advocating processing inside pixel arrays (P2M) and a technology-circuit-algorithm tri-design that integrates 3D integration, analog in-pixel computation, and hardware-aware training. It introduces a CMOS+RRAM hybrid where weights are stored as resistance states and computed via in-pixel convolutions, with BN and ReLU fused through a single-slope ADC, enabling multi-channel, multi-bit CNN operations at the sensor. The authors provide a comprehensive trade-off framework across area, bandwidth, latency, energy, and accuracy, showing how design choices and 3D integration constraints shape performance and suggesting that reconfigurable weights (NVM) improve adaptability. They also outline future directions—non-linearity-aware modeling, frame skipping, and distributed computing with sensor fusion—to further reduce data movement and energy while preserving accuracy in real-world tasks such as autonomous driving and surveillance. Overall, the work demonstrates that on-device, end-to-end co-design is essential to realize significant improvements in power, bandwidth reduction, and latency for P2M systems, especially when handling complex visual tasks.

Abstract

The massive amounts of data generated by camera sensors motivate data processing inside pixel arrays, i.e., at the extreme-edge. Several critical developments have fueled recent interest in the processing-in-pixel-in-memory paradigm for a wide range of visual machine intelligence tasks, including (1) advances in 3D integration technology to enable complex processing inside each pixel in a 3D integrated manner while maintaining pixel density, (2) analog processing circuit techniques for massively parallel low-energy in-pixel computations, and (3) algorithmic techniques to mitigate non-idealities associated with analog processing through hardware-aware training schemes. This article presents a comprehensive technology-circuit-algorithm landscape that connects technology capabilities, circuit design strategies, and algorithmic optimizations to power, performance, area, bandwidth reduction, and application-level accuracy metrics. We present our results using a comprehensive co-design framework incorporating hardware and algorithmic optimizations for various complex real-life visual intelligence tasks mapped onto our P2M paradigm.

Technology-Circuit-Algorithm Tri-Design for Processing-in-Pixel-in-Memory (P2M)

TL;DR

Abstract

Paper Structure (15 sections, 4 equations, 7 figures)

This paper contains 15 sections, 4 equations, 7 figures.

Introduction
Processing-in-Pixel Pre-requisites
Prior Work
Proposed RRAM-based in-Pixel Processing
Technology-Circuit-Algorithm Trade-off Analysis
Area Trade-off Analysis
Bandwidth Trade-off Analysis
Latency Trade-off Analysis
Energy Trade-off Analysis
Accuracy Trade-off Analysis
Future Directions
Improved Non-Linearity Modeling for P$^2$M
Frame Skipping
Distributed Computing and Sensor Fusion
Discussions and Conclusions

Figures (7)

Figure 1: Overall P$^2$M-enabled CIS system. (a) Back-side illuminated CMOS image sensor (BI-CIS) die, (b) weight-containing die, (c) pixel circuit, (d) multi-bit multi-channel positive and negative weight banks (mapped into transistor's width (CMOS), or the resistance state (NVM)), (e) SS-ADC performing the ReLU and part of BN operations, (f) IO configurations, (g) different integration technologies for bonding interface, (h) algorithm-hardware co-design framework.
Figure 2: RRAM-based circuit techniques and simulated output for P$^2$M-enabled CIS. (a) Weight embedded pixel circuit, and (b) a scatter plot comparing the simulated convolutional results (normalized V_OUT) with ideal convolutional results (normalized weightinput, WI) using GF 22nm FD-SOI process node for a kernel size of 3$\times$3$\times$3.
Figure 3: Area trade-off analysis of P$^2$M-enabled CIS. (a) An example layout floor-plan for the weight transistors and 3D integrated bonds, (b) normalized area versus output channels for different process nodes and stride numbers considering Cu-Cu hybrid bonding interface, and (c) minimum pixel pitch versus output channels for different nodes and integration technologies.
Figure 4: Bandwidth reduction (BR) trade-off analysis of P$^2$M-enabled CIS. BR versus stride number for different numbers of output channels and pooling stride.
Figure 5: Latency trade-off analysis of P$^2$M-enabled CIS. Maximum frame rate versus stride number for different numbers of output channels and pixel binning.
...and 2 more figures

Technology-Circuit-Algorithm Tri-Design for Processing-in-Pixel-in-Memory (P2M)

TL;DR

Abstract

Technology-Circuit-Algorithm Tri-Design for Processing-in-Pixel-in-Memory (P2M)

Authors

TL;DR

Abstract

Table of Contents

Figures (7)