Compressed Meta-Optical Encoder for Image Classification
Anna Wirth-Singh, Jinlin Xiang, Minho Choi, Johannes E. Fröch, Luocheng Huang, Shane Colburn, Eli Shlizerman, Arka Majumdar
TL;DR
The paper presents a hybrid optical-electronic CNN that replaces most convolutional processing with a single optical convolution implemented via PSF-engineered meta-optics, while the electronic backend performs a linear classifier. Knowledge distillation from a pretrained AlexNet-Mod teacher enables compressing the network to two linear layers, circumventing the need for optical nonlinearities. Experimentally, a 16-kernel meta-optic front end coupled to a calibrated electronic backend achieves ~93–94% MNIST accuracy with ~85k MACs, representing substantial reductions in latency and power while maintaining competitive accuracy. The approach highlights scalable benefits for high-resolution inputs, due to the optical convolution’s effective constant-time scaling and seamless integration with existing CNN architectures.
Abstract
Optical and hybrid convolutional neural networks (CNNs) recently have become of increasing interest to achieve low-latency, low-power image classification and computer vision tasks. However, implementing optical nonlinearity is challenging, and omitting the nonlinear layers in a standard CNN comes at a significant reduction in accuracy. In this work, we use knowledge distillation to compress modified AlexNet to a single linear convolutional layer and an electronic backend (two fully connected layers). We obtain comparable performance to a purely electronic CNN with five convolutional layers and three fully connected layers. We implement the convolution optically via engineering the point spread function of an inverse-designed meta-optic. Using this hybrid approach, we estimate a reduction in multiply-accumulate operations from 17M in a conventional electronic modified AlexNet to only 86K in the hybrid compressed network enabled by the optical frontend. This constitutes over two orders of magnitude reduction in latency and power consumption. Furthermore, we experimentally demonstrate that the classification accuracy of the system exceeds 93% on the MNIST dataset.
