Technology-Circuit-Algorithm Tri-Design for Processing-in-Pixel-in-Memory (P2M)
Md Abdullah-Al Kaiser, Gourav Datta, Sreetama Sarkar, Souvik Kundu, Zihan Yin, Manas Garg, Ajey P. Jacob, Peter A. Beerel, Akhilesh R. Jaiswal
TL;DR
This paper addresses the data-deluge problem in vision systems by advocating processing inside pixel arrays (P2M) and a technology-circuit-algorithm tri-design that integrates 3D integration, analog in-pixel computation, and hardware-aware training. It introduces a CMOS+RRAM hybrid where weights are stored as resistance states and computed via in-pixel convolutions, with BN and ReLU fused through a single-slope ADC, enabling multi-channel, multi-bit CNN operations at the sensor. The authors provide a comprehensive trade-off framework across area, bandwidth, latency, energy, and accuracy, showing how design choices and 3D integration constraints shape performance and suggesting that reconfigurable weights (NVM) improve adaptability. They also outline future directions—non-linearity-aware modeling, frame skipping, and distributed computing with sensor fusion—to further reduce data movement and energy while preserving accuracy in real-world tasks such as autonomous driving and surveillance. Overall, the work demonstrates that on-device, end-to-end co-design is essential to realize significant improvements in power, bandwidth reduction, and latency for P2M systems, especially when handling complex visual tasks.
Abstract
The massive amounts of data generated by camera sensors motivate data processing inside pixel arrays, i.e., at the extreme-edge. Several critical developments have fueled recent interest in the processing-in-pixel-in-memory paradigm for a wide range of visual machine intelligence tasks, including (1) advances in 3D integration technology to enable complex processing inside each pixel in a 3D integrated manner while maintaining pixel density, (2) analog processing circuit techniques for massively parallel low-energy in-pixel computations, and (3) algorithmic techniques to mitigate non-idealities associated with analog processing through hardware-aware training schemes. This article presents a comprehensive technology-circuit-algorithm landscape that connects technology capabilities, circuit design strategies, and algorithmic optimizations to power, performance, area, bandwidth reduction, and application-level accuracy metrics. We present our results using a comprehensive co-design framework incorporating hardware and algorithmic optimizations for various complex real-life visual intelligence tasks mapped onto our P2M paradigm.
