Table of Contents
Fetching ...

Applicability of Metalenses for Generalizable Computer Vision

Yubo Zhang, Johannes Fröch, Jinlin Xiang, Shane Colburn, Myunghoo Lee, Zhihao Zhou, Minho Choi, Eli Shlizerman, Arka Majumdar

TL;DR

This work investigates the generalizability of meta-optical encoders for computer vision by pairing a single-aperture metasurface with differentiable end-to-end optimization. It compares a broadband end-to-end-optimized metalens to a fixed hyperboloid baseline, showing that end-to-end design yields higher classification accuracy and more balanced RGB frequency response, closely tied to the modulation transfer function ($MTF$). The study demonstrates that preserving in-band spatial-frequency content, as quantified by the $MTF$ integral within the sensor cutoff, is a key interpretable factor driving ONN performance and robustness to sensor resolution. It proposes the $MTF$-preservation principle as a design guideline and outlines pathways for extending to multi-aperture, polarization, and joint sensor co-design for scalable, generalizable meta-optical encoders in computer vision.

Abstract

Optical neural networks (ONNs) are gaining increasing attention to accelerate machine learning tasks. In particular, static meta-optical encoders designed for task-specific pre-processing demonstrated orders of magnitude smaller energy consumption over purely digital counterpart, albeit at the cost of slight degradation in classification accuracy. However, a lack of generalizability poses serious challenges for wide deployment of static meta-optical front-ends. Here, we investigate the utility of a metalens for generalized computer vision. Specifically, we show that a metalens optimized for full-color imaging can achieve image classification accuracy comparable to high-end, sensor-limited optics and consistently outperforms a hyperboloid metalens across a wide range of sensor pixel sizes. We further design an end-to-end single aperture metasurface for ImageNet classification and find that the optimized metasurface tends to balance the modulation transfer function (MTF) for each wavelength. Together, these findings highlight that the preservation of spatial frequency-domain information is an essential interpretable factor underlying ONN performance. Our work provides both an interpretable understanding of task-driven optical optimization and practical guidance for designing high-performance ONNs and meta-optical encoders for generalizable computer vision.

Applicability of Metalenses for Generalizable Computer Vision

TL;DR

This work investigates the generalizability of meta-optical encoders for computer vision by pairing a single-aperture metasurface with differentiable end-to-end optimization. It compares a broadband end-to-end-optimized metalens to a fixed hyperboloid baseline, showing that end-to-end design yields higher classification accuracy and more balanced RGB frequency response, closely tied to the modulation transfer function (). The study demonstrates that preserving in-band spatial-frequency content, as quantified by the integral within the sensor cutoff, is a key interpretable factor driving ONN performance and robustness to sensor resolution. It proposes the -preservation principle as a design guideline and outlines pathways for extending to multi-aperture, polarization, and joint sensor co-design for scalable, generalizable meta-optical encoders in computer vision.

Abstract

Optical neural networks (ONNs) are gaining increasing attention to accelerate machine learning tasks. In particular, static meta-optical encoders designed for task-specific pre-processing demonstrated orders of magnitude smaller energy consumption over purely digital counterpart, albeit at the cost of slight degradation in classification accuracy. However, a lack of generalizability poses serious challenges for wide deployment of static meta-optical front-ends. Here, we investigate the utility of a metalens for generalized computer vision. Specifically, we show that a metalens optimized for full-color imaging can achieve image classification accuracy comparable to high-end, sensor-limited optics and consistently outperforms a hyperboloid metalens across a wide range of sensor pixel sizes. We further design an end-to-end single aperture metasurface for ImageNet classification and find that the optimized metasurface tends to balance the modulation transfer function (MTF) for each wavelength. Together, these findings highlight that the preservation of spatial frequency-domain information is an essential interpretable factor underlying ONN performance. Our work provides both an interpretable understanding of task-driven optical optimization and practical guidance for designing high-performance ONNs and meta-optical encoders for generalizable computer vision.

Paper Structure

This paper contains 17 sections, 5 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: a. Schematic of the single aperture meta-optic encoder. An image of axolotl from ImageNet30 is displayed and encoded with the meta-optics, captured by a sensor and passed on to a downstream NN, where classification is performed. b. Schematic of end-to-end design and PSF engineering pipeline for a hybrid ONN system. The forward differentiable information propagation is indicated by blue arrows. The PSF-engineering-based ONN design follows the red arrows and involves two steps: first, retrieving the desired PSF(kernels). They may come from a convolution layer or some pure optical information priors; second, inversely designing the scatterer distribution on the meta-optics to match that particular PSF. In contrast, (green arrows) the end-to-end design jointly optimizes the optical encoder and the digital backend through a differentiable simulation pipeline, where sensor-plane signals are propagated to the digital network and the overall loss is minimized via gradient descent.
  • Figure 2: a. Schematic of a single scatterer in a 2D $\mathrm{Si_3N_4}$-on-quartz meta-optics array. b. RCWA simulation of the scatterer’s phase and amplitude responses at red (606 nm), green (511 nm), and blue (462 nm) wavelengths. c. SEM images of the end-to-end designed meta-optics. d. Microscope images of the end-to-end designed meta-optics (left) and the hyperboloid benchmark (right).
  • Figure 3: a. RGB PSFs of End-to-end lens (up) and hyperboloid lens (down). b. Log-scaled MTF of the end-to-end metalens(left) and hyperboloid metalens(right) c. (left) end-to-end(up)/Hyperboloid(down) metalens imaging results of an RGB pattern. (right) Captured images of ImageNet by end-to-end/hyperboloid metalens on sensor.
  • Figure 4: a. Increasing the sensor pixel size spatially averages the incident intensity, which reduces the resolvable PSF detail and consequently shifts the modulation transfer function (MTF) cutoff toward lower spatial frequencies. During end-to-end optimization with varying sensor pixel sizes, this sensor-imposed cutoff frequency defines the effective passband of the optical encoder, thereby determining which spectral regions of the MTF receive stronger optimization emphasis. b. Classification accuracy on ImageNet30 under different sensor binning levels, comparing (left) AlexNet and (right) EfficientNet digital back ends. Error bars denote accuracy variation across runs. Solid lines are visual aids for the trend and do not correspond to a theoretical prediction. Shaded regions visualize the accuracy spread and should not be interpreted as statistical confidence intervals.
  • Figure 5: a. Radial MTF curves before (dashed RGB lines) and after (solid RGB lines) end-to-end optimization using AlexNet and EfficientNet backends. Both x-axises are expressed in relative spatial-frequency and normalized-MTF units. b. Corresponding logarithmic-scale heatmaps of the MTFs shown in a.
  • ...and 5 more figures