Spatially Varying Nanophotonic Neural Networks

Kaixuan Wei; Xiao Li; Johannes Froech; Praneeth Chakravarthula; James Whitehead; Ethan Tseng; Arka Majumdar; Felix Heide

Spatially Varying Nanophotonic Neural Networks

Kaixuan Wei, Xiao Li, Johannes Froech, Praneeth Chakravarthula, James Whitehead, Ethan Tseng, Arka Majumdar, Felix Heide

TL;DR

This work tackles the gap between optical neural networks and modern digital models by embedding computation in the camera optics. It introduces a large-kernel spatially-varying convolution (LKSV) implemented with a meta-optical front-end of nanophotonic metalenses and a lightweight electronic backend, achieving a mostly optical computation regime (>99% MACs) with a 4 mm front-end footprint. The LKSV kernel is learned via a low-dimensional reparameterization that factorizes a $15×15$ kernel into seven $3×3$ kernels and uses a spatially-varying basis, trained with regularizers to yield robust optical performance. Experimentally, the system reaches $72.76 ext{%}$ CIFAR-10 accuracy, outperforming AlexNet on CIFAR-10 with far fewer electronic parameters, and demonstrates transfer to ImageNet (top-5 $48.64 ext{%}$) and other vision tasks, validating the practicality of reconfigurable optical computing at the edge. Overall, this work shows that photonic front-ends can achieve modern deep-learning performance with ultra-low power, enabling fast, compact, edge-friendly AI accelerators.

Abstract

The explosive growth of computation and energy cost of artificial intelligence has spurred strong interests in new computing modalities as potential alternatives to conventional electronic processors. Photonic processors that execute operations using photons instead of electrons, have promised to enable optical neural networks with ultra-low latency and power consumption. However, existing optical neural networks, limited by the underlying network designs, have achieved image recognition accuracy far below that of state-of-the-art electronic neural networks. In this work, we close this gap by embedding massively parallelized optical computation into flat camera optics that perform neural network computation during the capture, before recording an image on the sensor. Specifically, we harness large kernels and propose a large-kernel spatially-varying convolutional neural network learned via low-dimensional reparameterization techniques. We experimentally instantiate the network with a flat meta-optical system that encompasses an array of nanophotonic structures designed to induce angle-dependent responses. Combined with an extremely lightweight electronic backend with approximately 2K parameters we demonstrate a reconfigurable nanophotonic neural network reaches 72.76\% blind test classification accuracy on CIFAR-10 dataset, and, as such, the first time, an optical neural network outperforms the first modern digital neural network -- AlexNet (72.64\%) with 57M parameters, bringing optical neural network into modern deep learning era.

Spatially Varying Nanophotonic Neural Networks

TL;DR

kernel into seven

kernels and uses a spatially-varying basis, trained with regularizers to yield robust optical performance. Experimentally, the system reaches

CIFAR-10 accuracy, outperforming AlexNet on CIFAR-10 with far fewer electronic parameters, and demonstrates transfer to ImageNet (top-5

) and other vision tasks, validating the practicality of reconfigurable optical computing at the edge. Overall, this work shows that photonic front-ends can achieve modern deep-learning performance with ultra-low power, enabling fast, compact, edge-friendly AI accelerators.

Abstract

Paper Structure (7 sections, 5 figures)

This paper contains 7 sections, 5 figures.

Large-Kernel Spatially-Varying Parameterization
Experimental Validation
Versatile Reconfigurable Computational Camera
Design and Optimization
Sample Fabrication
Experimental Setup
Data Availability

Figures (5)

Figure 1: Spatially varying nanophotonic neural networks. (a) Illustration of the proposed opto-electronic network, which comprises a nanophotonic array front-end that optically encodes the scene into multichannel image features and a lightweight electronic back-end that performs the final prediction, in a programmable manner, for image classification or semantic segmentation; (b) Each metalens is designed for specific learned large and angularly varying point spread functions that comprise the feature kernels of the early network layers which vary over the sensor. These kernels are learned electronically using a spatially varying reparameterization. (c) We learn large kernels of size $15 \times 15$ (for digital $32 \times 32$ image classification) by factorizing them into a cascade of smaller ones. (d) Assessment of purely electronic AlexNet krizhevsky2012imagenet compared to SVN$^3$ : we report classification accuracies on CIFAR-10 and ImageNet datasets (top barplot), digital multiply–accumulate (MACs) operations, and digital parameters (bottom barplot) for CIFAR-10 image recognition. The proposed method outperforms a network with multiple orders of magnitude more electronic parameters with multiple orders of magnitude fewer multiply–accumulate operations, see Table S2 for details. (e) Representative spatially-varying kernels plotted over space (see Figure S3 for high-resolution illustration) and the corresponding kernel standard deviation, illustrating the variation, at each spatial location (second row).
Figure 2: Experimental validation of SVN$^3$. (a) Flat camera prototype (left) and a metalens array device before mounting (right); (b) Illustration of the experimental setup, consisting of an OLED display placed at the designated object distance, metalens array, and CMOS sensor. Note that no additional optics are used. Camera and display are synchronized for data capture; (c) Spatially-varying PSF visualization on a $3\times 3$ sampling grid of incident angles. Here, we show four representative kernels; (d) Side-by-side comparison of the experimental measurements that match the corresponding ground truth feature channels. "Real-valued" denotes the target feature channel, the negative image feature subtracted from positive image features post-convolution.
Figure 3: Experimental measurements of a fabricated chip of a design for CIFAR-10 image classification. (a) Qualitative assessment of the experimental measurements compared with the ground truth feature channels. "Real-valued" again denotes the target feature channels via subtracting the negative from the positive image features post-convolution. (b) The confusion matrices of the experimental and simulation results on the CIFAR-10 test dataset validate the effectiveness of the method.
Figure 4: Experimental (top-2) classification (probability) results on random samples from CIFAR-10 test set. Green and Orange colored labels under the images denote the correct and incorrect predictions, respectively. The method accurately predicts the correct class or a visually similar class. See Figure S15 and S16 for additional examples.
Figure 5: Validation of SVN$^3$ as a versatile camera for diverse vision tasks. (a) Experimentally measured feature maps of SVN$^3$ on the ImageNet dataset. (b) Recognition on ImageNet and other downstream datasets (CIFAR-100, Flowers-102, Food-101, and Pet-37) using the same optical front-end and the transfer-learned electronic decoder. (c) Transfer learning for semantic segmentation on PASCAL VOC dataset. SVN$^3$ again achieves comparable or better performance than the AlexNet-based segmentation network (see Figure S17 for additional examples). These findings validate that the proposed camera, with a fixed optical encoder, can generalize to diverse tasks by adapting the electronic backend.

Spatially Varying Nanophotonic Neural Networks

TL;DR

Abstract

Spatially Varying Nanophotonic Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)