Table of Contents
Fetching ...

MOCLIP: A Foundation Model for Large-Scale Nanophotonic Inverse Design

S. Rodionov, A. Burguete-Lopez, M. Makarenko, Q. Wang, F. Getman, A. Fratalocchi

TL;DR

This work presents MOCLIP (Metasurface Optics Contrastive Learning Pretrained), a nanophotonic foundation model that integrates metasurface geometry and spectra within a shared latent space to position MOCLIP as a scalable and versatile platform for next-generation photonic design and data-driven applications.

Abstract

Foundation models (FM) are transforming artificial intelligence by enabling generalizable, data-efficient solutions across different domains for a broad range of applications. However, the lack of large and diverse datasets limits the development of FM in nanophotonics. This work presents MOCLIP (Metasurface Optics Contrastive Learning Pretrained), a nanophotonic foundation model that integrates metasurface geometry and spectra within a shared latent space. MOCLIP employs contrastive learning to align geometry and spectral representations using an experimentally acquired dataset with a sample density comparable to ImageNet-1K. The study demonstrates MOCLIP inverse design capabilities for high-throughput zero-shot prediction at a rate of 0.2 million samples per second, enabling the design of a full 4-inch wafer populated with high-density metasurfaces in minutes. It also shows generative latent-space optimization reaching 97 percent accuracy. Finally, we introduce an optical information storage concept that uses MOCLIP to achieve a density of 0.1 Gbit per square millimeter at the resolution limit, exceeding commercial optical media by a factor of six. These results position MOCLIP as a scalable and versatile platform for next-generation photonic design and data-driven applications.

MOCLIP: A Foundation Model for Large-Scale Nanophotonic Inverse Design

TL;DR

This work presents MOCLIP (Metasurface Optics Contrastive Learning Pretrained), a nanophotonic foundation model that integrates metasurface geometry and spectra within a shared latent space to position MOCLIP as a scalable and versatile platform for next-generation photonic design and data-driven applications.

Abstract

Foundation models (FM) are transforming artificial intelligence by enabling generalizable, data-efficient solutions across different domains for a broad range of applications. However, the lack of large and diverse datasets limits the development of FM in nanophotonics. This work presents MOCLIP (Metasurface Optics Contrastive Learning Pretrained), a nanophotonic foundation model that integrates metasurface geometry and spectra within a shared latent space. MOCLIP employs contrastive learning to align geometry and spectral representations using an experimentally acquired dataset with a sample density comparable to ImageNet-1K. The study demonstrates MOCLIP inverse design capabilities for high-throughput zero-shot prediction at a rate of 0.2 million samples per second, enabling the design of a full 4-inch wafer populated with high-density metasurfaces in minutes. It also shows generative latent-space optimization reaching 97 percent accuracy. Finally, we introduce an optical information storage concept that uses MOCLIP to achieve a density of 0.1 Gbit per square millimeter at the resolution limit, exceeding commercial optical media by a factor of six. These results position MOCLIP as a scalable and versatile platform for next-generation photonic design and data-driven applications.

Paper Structure

This paper contains 12 sections, 2 equations, 8 figures.

Figures (8)

  • Figure 1: State-of-the-art nanophotonic and computer vision datasets. Dataset size vs DOFs, illustrating empirical scaling trends. Computer vision dataset sizes and DOFs taken from pope2021intrinsic. The scaling law is driven by the equation $N=kd^{\alpha}$, with fitted parameters $\alpha = 1.72$ and $k = 442$.
  • Figure 2: MOCLIP concept.a. Dataset of randomly generated free-form metasurface geometries. b. Experimental realization and characterization of the designs using an automated hyperspectral microscope. c. Dataset of measured spectra corresponding to the free-form metasurface designs. d. Encoding of experimental metasurface spectral responses. e. Encoding of metasurface geometries. f. Shared latent space for spectral and geometrical information. g. Zero-shot prediction for inverse design (left) and spectra prediction (right). h. Latent space optimization for inverse design (left) and spectra prediction (right). i. Optical information storage via physical implementation of latent space information.
  • Figure 3: Dataset generation.a. A set of fabricated samples, each carrying six metasurface arrays of amorphous silicon on a fused-silica substrate, totaling 136.0 530 dataset samples per substrate. b-d. SEM pictures of the fabricated metasurface arrays. e. Hyperspectral transmission microscopy setup. f. Spatial and spectral data hypercube of a metasurface array under fixed polarization conditions, shown in false colors for visualization. g. Expanded view of the metasurface array under broadband illumination. h. Example spectra extracted from the pixels of a single metasurface pad for x- and y-polarized illumination (green and blue sets of curves) and average pixel responses (black and orange curves).
  • Figure 4: MOCLIP training.a. Representation of geometrical metasurface information as $128\times 128$ binary images with period and thickness parameters. b. Geometry encoder, comprising a CNN architecture. c. 64-dimensional latent vectors produced by the geometry encoder. d. Spectral information represented by two 29-dimensional vectors for x- and y-polarizations. e. Spectra encoder based on an MLP architecture. f. 64-dimensional latent vectors produced by the spectra encoder. g. Similarity matrix containing pairwise dot products between latent vectors from both modalities. Yellow entries correspond to matching geometry–spectra pairs trained to have similarity values close to 1, with the rest indicating non-matching pairs trained to approach zero.
  • Figure 5: Zero-shot prediction.a. Target spectrum encoding into a latent vector. b. Encoding of probe geometries, including the ground truth one, into latent vectors. c. Similarity score vector between the target spectrum and the candidate geometry latent vectors. The yellow cell indicates the best matching score for the predicted geometry. d. Top-k accuracy for the test dataset. e. Statistical distribution of the MSE between the target spectrum and measured spectra of the predicted geometries as a function of the number of probe geometries. Shaded bands indicate quantile regions; the solid line indicates the mean.
  • ...and 3 more figures