M3ICRO: Machine Learning-Enabled Compact Photonic Tensor Core based on PRogrammable Multi-Operand Multimode Interference

Jiaqi Gu; Hanqing Zhu; Chenghao Feng; Zixuan Jiang; Ray T. Chen; David Z. Pan

M3ICRO: Machine Learning-Enabled Compact Photonic Tensor Core based on PRogrammable Multi-Operand Multimode Interference

Jiaqi Gu, Hanqing Zhu, Chenghao Feng, Zixuan Jiang, Ray T. Chen, David Z. Pan

TL;DR

M3ICRO introduces a ML-enabled, programmable MOMMI-based coherent photonic tensor core to overcome the footprint and density limits of conventional PTCs. By cascading programmable MOMMIs and applying a block unfolding scheme, the design achieves near-universal matrix expressivity while supporting efficient real-valued linear transforms. A differentiable ML training flow with a neural device predictor enables differentiable optimization of device variables, dramatically accelerating design compared to full Maxwell simulations. Across multiple models and benchmarks, M3ICRO demonstrates substantially higher compute density, faster throughput, lower footprint, and robust performance relative to state-of-the-art coherent PTCs and conventional GPUs, highlighting the practical potential of device customization for scalable photonic ML acceleration.

Abstract

Photonic computing shows promise for transformative advancements in machine learning (ML) acceleration, offering ultra-fast speed, massive parallelism, and high energy efficiency. However, current photonic tensor core (PTC) designs based on standard optical components hinder scalability and compute density due to their large spatial footprint. To address this, we propose an ultra-compact PTC using customized programmable multi-operand multimode interference (MOMMI) devices, named M3ICRO. The programmable MOMMI leverages the intrinsic light propagation principle, providing a single-device programmable matrix unit beyond the conventional computing paradigm of one multiply-accumulate (MAC) operation per device. To overcome the optimization difficulty of customized devices that often requires time-consuming simulation, we apply ML for optics to predict the device behavior and enable a differentiable optimization flow. We thoroughly investigate the reconfigurability and matrix expressivity of our customized PTC, and introduce a novel block unfolding method to fully exploit the computing capabilities of a complex-valued PTC for near-universal real-valued linear transformations. Extensive evaluations demonstrate that M3ICRO achieves a 3.4-9.6x smaller footprint, 1.6-4.4x higher speed, 10.6-42x higher compute density, 3.7-12x higher system throughput, and superior noise robustness compared to state-of-the-art coherent PTC designs, while maintaining close-to-digital task accuracy across various ML benchmarks. Our code is open-sourced at https://github.com/JeremieMelo/M3ICRO-MOMMI.

M3ICRO: Machine Learning-Enabled Compact Photonic Tensor Core based on PRogrammable Multi-Operand Multimode Interference

TL;DR

Abstract

Paper Structure (22 sections, 10 equations, 15 figures, 4 tables)

This paper contains 22 sections, 10 equations, 15 figures, 4 tables.

Introduction
Proposed MOMMI-based PTC M3ICRO
Initial State Design of General MMI Device
M3ICRO: Programmable MOMMI-based PTC
Efficient Complex Tensor Core via Block Unfolding
Machine Learning-Enabled Differentiable Optimization
Expressivity of Programmable MOMMI
Hardware Performance and Efficiency Analysis
Evaluation
Training Setups
Accuracy Evaluation
Quantization Tolerance Evaluation
Device Noise Robustness Evaluation
Ablation Study on Block Unfolding
Advance Compute Density vs. Efficiency Pareto Frontier
...and 7 more sections

Figures (15)

Figure 1: Overview of photonic tensor core designs with increasing compute density. PTCs with standard devices: (a) MZI array NP_NATURE2017_Shen, (b) MRR weight bank NP_SciRep2017_Tait, (c) Butterfly-style PTC NP_ASPDAC2020_Gu, and (d) PCM crossbar NP_Nature2021_Feldmann. (e) Our proposed M3ICRO PTC with customized MMI devices and trained with a machine learning-based approach.
Figure 2: (a) Real and (b) imaginary parts of the transfer matrix of the optimized 4$\times$4 MMI. (c) Detailed sizes of the MMI. (d) The transfer matrix of the optimized MMI is close to a unitary matrix.
Figure 3: A $d$-op $k\times k$ programmable multi-operand MMI.
Figure 4: Visualization of the complex transfer matrix $W{(\epsilon)}\in\mathbb{C}^{4\times 4}$ of a 4-op 4$\times$4 MOMMI in the projected 2-D space using t-SNE. Each pad is discretized to 8 uniform levels (3-bit) and normalized to [0,1] by the maximum index change (0.03). Matrices are colored based on $\sum_i\epsilon_i$.
Figure 5: The proposed MOMMI-based photonic tensor core M3ICRO with $P$ parallel paths and $C$ cascaded components.
...and 10 more figures

M3ICRO: Machine Learning-Enabled Compact Photonic Tensor Core based on PRogrammable Multi-Operand Multimode Interference

TL;DR

Abstract

M3ICRO: Machine Learning-Enabled Compact Photonic Tensor Core based on PRogrammable Multi-Operand Multimode Interference

Authors

TL;DR

Abstract

Table of Contents

Figures (15)