Table of Contents
Fetching ...

Opto-Electronic Convolutional Neural Network Design Via Direct Kernel Optimization

Ali Almuallem, Harshana Weligampola, Abhiram Gnanasambandam, Wei Xu, Dilshan Godaliyadda, Hamid R. Sheikh, Stanley H. Chan, Qi Guo

TL;DR

This work tackles the prohibitive cost of end-to-end optimization in opto-electronic CNNs by proposing a two-stage design: first train a conventional electronic CNN, then realize the optical front-end as a metasurface array by directly optimizing the kernels of the first convolutional layer. The Direct Kernel Optimization (DKO) approach reduces the design search space and training burden while maintaining accuracy, demonstrated on monocular depth estimation where the two-stage method outperforms end-to-end training under the same budget. Key contributions include formulating the optical front-end as a kernel-mimicking metasurface, applying a differentiable optical simulator for phase optimization, and validating the approach via comprehensive simulations on KITTI with Monodepth2. The results indicate substantial reductions in computation and parameter counts, with practical implications for scalable, fast, energy-efficient hybrid vision systems that exploit optical preprocessing for dense prediction tasks.

Abstract

Opto-electronic neural networks integrate optical front-ends with electronic back-ends to enable fast and energy-efficient vision. However, conventional end-to-end optimization of both the optical and electronic modules is limited by costly simulations and large parameter spaces. We introduce a two-stage strategy for designing opto-electronic convolutional neural networks (CNNs): first, train a standard electronic CNN, then realize the optical front-end implemented as a metasurface array through direct kernel optimization of its first convolutional layer. This approach reduces computational and memory demands by hundreds of times and improves training stability compared to end-to-end optimization. On monocular depth estimation, the proposed two-stage design achieves twice the accuracy of end-to-end training under the same training time and resource constraints.

Opto-Electronic Convolutional Neural Network Design Via Direct Kernel Optimization

TL;DR

This work tackles the prohibitive cost of end-to-end optimization in opto-electronic CNNs by proposing a two-stage design: first train a conventional electronic CNN, then realize the optical front-end as a metasurface array by directly optimizing the kernels of the first convolutional layer. The Direct Kernel Optimization (DKO) approach reduces the design search space and training burden while maintaining accuracy, demonstrated on monocular depth estimation where the two-stage method outperforms end-to-end training under the same budget. Key contributions include formulating the optical front-end as a kernel-mimicking metasurface, applying a differentiable optical simulator for phase optimization, and validating the approach via comprehensive simulations on KITTI with Monodepth2. The results indicate substantial reductions in computation and parameter counts, with practical implications for scalable, fast, energy-efficient hybrid vision systems that exploit optical preprocessing for dense prediction tasks.

Abstract

Opto-electronic neural networks integrate optical front-ends with electronic back-ends to enable fast and energy-efficient vision. However, conventional end-to-end optimization of both the optical and electronic modules is limited by costly simulations and large parameter spaces. We introduce a two-stage strategy for designing opto-electronic convolutional neural networks (CNNs): first, train a standard electronic CNN, then realize the optical front-end implemented as a metasurface array through direct kernel optimization of its first convolutional layer. This approach reduces computational and memory demands by hundreds of times and improves training stability compared to end-to-end optimization. On monocular depth estimation, the proposed two-stage design achieves twice the accuracy of end-to-end training under the same training time and resource constraints.

Paper Structure

This paper contains 7 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: We consider an opto-electronic convolutional neural network (CNN) that integrates a metasurface array with an electronic backend. The metasurface, a flat nanophotonic device, encodes the incident light from a scene into optical feature maps. As light propagates through the metasurface, it undergoes a phase modulation equivalent to convolving the common photograph of the scene with an engineered kernel. These optically generated feature maps are then processed electronically by a conventional CNN architecture.
  • Figure 2: Top row: Sample metasurface-learned kernels $h_{m,n}(u,v)$, and bottom row: corresponding kernels from the pretrained Monodepth2 model. Our optimized metasurfaces learn PSFs that closely match the original model's kernels.
  • Figure 3: Qualitative comparison on the KITTI dataset (simulation). The first column shows the input image; the second column shows the sparse ground-truth depth map; the third column shows the result from a simulated opto-electronic CNN, where the first convolutional layer is implemented by a metasurface and trained using the proposed two-stage strategy; the fourth and fifth columns show results from the same system trained end-to-end, initialized with and without the pretrained model, respectively. All design strategies utilize uniform training time (12h) and computational resources (one A100 GPU). The proposed two-stage strategy shows significantly better visual quality and accuracy compared to E2E strategies. The inset numbers indicate the RMSE (in meters) for each prediction.