A Reconfigurable Convolution-in-Pixel CMOS Image Sensor Architecture
Ruibing Song, Kejie Huang, Zongsheng Wang, Haibin Shen
TL;DR
The work tackles data movement and energy bottlenecks in CNN-based vision systems by moving the first-layer convolution into a reconfigurable Processing-in-Pixel CMOS image sensor. It introduces a PWM-driven MAC built from 2.5T-per-pixel units with in-array convlink wiring and kernel-splicing to realize $3\times3$ to $9\times9$ kernels, enabling parallel, in-sensor first-layer computation. Simulation results in a $128\times128$ array show high linearity ($R^2>0.98$), substantial readout reduction, and a computing efficiency up to $11.65$ TOPS/W for a $7\times7$ kernel at $60$ FPS, with energy per frame far lower than traditional CIS+DLA systems. The approach improves fill-factor, reduces power, and supports scalable kernel sizes for IoT and surveillance scenarios, where low-latency, low-power vision is critical.
Abstract
The separation of the data capture and analysis in modern vision systems has led to a massive amount of data transfer between the end devices and cloud computers, resulting in long latency, slow response, and high power consumption. Efficient hardware architectures are under focused development to enable Artificial Intelligence (AI) at the resource-limited end sensing devices. One of the most promising solutions is to enable Processing-in-Pixel (PIP) scheme. However, the conventional schemes suffer from the low fill-factor issue. This paper proposes a PIP based CMOS sensor architecture, which allows convolution operation before the column readout circuit to significantly improve the image reading speed with much lower power consumption. The simulation results show that the proposed architecture could support the computing efficiency up to 11.65 TOPS/W at the 8-bit weight configuration, which is three times as high as the conventional schemes. The transistors required for each pixel are only 2.5T, significantly improving the fill-factor.
