TeMPO: Efficient Time-Multiplexed Dynamic Photonic Tensor Core for Edge AI with Compact Slow-Light Electro-Optic Modulator
Meng Zhang, Dennis Yin, Nicholas Gangi, Amir Begović, Alexander Chen, Zhaoran Rena Huang, Jiaqi Gu
TL;DR
TeMPO addresses the energy and bandwidth challenges of edge AI by delivering a time-multiplexed dynamic photonic tensor core built on cross-layer device, circuit, and architecture customization. It combines a compact slow-light MZM input encoder, tailored optical splitters and phase shifters, hierarchical analog partial-product accumulation via capacitive temporal integration, and a multi-core tiled layout to maximize sharing and minimize ADC overhead. The approach achieves 368.6 TOPS, 22.3 TOPS/W, and 1.2 TOPS/mm$^2$ compute density on representative edge workloads with 6-bit quantization, while maintaining robustness to hardware noise. This work demonstrates a practical pathway to Pareto-frontier electronic-photonic accelerators for real-time, energy-efficient edge inference through deliberate cross-layer design.
Abstract
Electronic-photonic computing systems offer immense potential in energy-efficient artificial intelligence (AI) acceleration tasks due to the superior computing speed and efficiency of optics, especially for real-time, low-energy deep neural network (DNN) inference tasks on resource-restricted edge platforms. However, current optical neural accelerators based on foundry-available devices and conventional system architecture still encounter a performance gap compared to highly customized electronic counterparts. To bridge the performance gap due to lack of domain specialization, we present a time-multiplexed dynamic photonic tensor accelerator, dubbed TeMPO, with cross-layer device/circuit/architecture customization. At the device level, we present foundry-compatible, customized photonic devices, including a slow-light electro-optic modulator with experimental demonstration, optical splitters, and phase shifters that significantly reduce the footprint and power in input encoding and dot-product calculation. At the circuit level, partial products are hierarchically accumulated via parallel photocurrent aggregation, lightweight capacitive temporal integration, and sequential digital summation, considerably relieving the analog-to-digital conversion bottleneck. We also employ a multi-tile, multi-core architecture to maximize hardware sharing for higher efficiency. Across diverse edge AI workloads, TeMPO delivers digital-comparable task accuracy with superior quantization/noise tolerance. We achieve a 368.6 TOPS peak performance, 22.3 TOPS/W energy efficiency, and 1.2 TOPS/mm$^2$ compute density, pushing the Pareto frontier in edge AI hardware. This work signifies the power of cross-layer co-design and domain-specific customization, paving the way for future electronic-photonic accelerators with even greater performance and efficiency.
