CViT: Continuous Vision Transformer for Operator Learning

Sifan Wang; Jacob H Seidman; Shyam Sankaran; Hanwen Wang; George J. Pappas; Paris Perdikaris

CViT: Continuous Vision Transformer for Operator Learning

Sifan Wang, Jacob H Seidman, Shyam Sankaran, Hanwen Wang, George J. Pappas, Paris Perdikaris

TL;DR

This work tackles operator learning for maps between infinite-dimensional function spaces by introducing CViT, a Continuous Vision Transformer that fuses a vision-transformer encoder with a trainable grid-based coordinate embedding in the decoder and a query-wise cross-attention mechanism for continuous evaluation. The approach yields resolution-invariant, multi-scale representations that can be queried at arbitrary coordinates, supported by Lipschitz/spectral-bias insights and extensive experiments on PDE benchmarks (advection, shallow-water, Navier–Stokes, diffusion-reaction). CViT achieves state-of-the-art or competitive results with fewer parameters and without extensive pretraining, highlighting strong parameter efficiency and robustness to discontinuities. By bridging conditioned neural fields with vision-transformer architectures, CViT offers a flexible, scalable framework for high-fidelity surrogates of complex physical dynamics, with significant implications for climate modeling and engineering design.

Abstract

Operator learning, which aims to approximate maps between infinite-dimensional function spaces, is an important area in scientific machine learning with applications across various physical domains. Here we introduce the Continuous Vision Transformer (CViT), a novel neural operator architecture that leverages advances in computer vision to address challenges in learning complex physical systems. CViT combines a vision transformer encoder, a novel grid-based coordinate embedding, and a query-wise cross-attention mechanism to effectively capture multi-scale dependencies. This design allows for flexible output representations and consistent evaluation at arbitrary resolutions. We demonstrate CViT's effectiveness across a diverse range of partial differential equation (PDE) systems, including fluid dynamics, climate modeling, and reaction-diffusion processes. Our comprehensive experiments show that CViT achieves state-of-the-art performance on multiple benchmarks, often surpassing larger foundation models, even without extensive pretraining and roll-out fine-tuning. Taken together, CViT exhibits robust handling of discontinuous solutions, multi-scale features, and intricate spatio-temporal dynamics. Our contributions can be viewed as a significant step towards adapting advanced computer vision architectures for building more flexible and accurate machine learning models in the physical sciences.

CViT: Continuous Vision Transformer for Operator Learning

TL;DR

Abstract

Paper Structure (82 sections, 3 theorems, 51 equations, 19 figures, 10 tables)

This paper contains 82 sections, 3 theorems, 51 equations, 19 figures, 10 tables.

Introduction
Background and Related Work
Background.
Transformer-Based Operator Learning.
Open Challenges.
Continuous Vision Transformer (CViT)
Architecture Description
Patch Embeddings.
Temporal Aggregation.
Transformer Encoder.
Cross-Attention Decoder.
Theoretical Insights.
CViT vs other Transformer-based Approaches.
Experiments
CViT model setup.
...and 67 more sections

Key Result

Proposition D.1

Suppose that a single layer of the Fourier Neural Operator (FNO) li2021fourier is given by Then the integral kernel term can be expressed as

Figures (19)

Figure 1: Continuous Vision Transformer (CViT) Architecture: CViT consists of the following components: (1) Spatio-temporal patch embeddings to extract localized features. (2) A temporal aggregation module based on the Perceiver architecture, which captures temporal correlations to compresses tokens along the time axis. (3) A Transformer encoder that captures multi-scale spatial dependencies via self-attention layers. (4) A novel grid-based positional encoding scheme for query coordinates, allowing for flexible output representation and interpolation. (5) A cross-attention decoder that integrates information from the input function with query coordinates.
Figure 2: Advection of discontinuous waveforms. Prediction (red dashed line) versus ground truth (blue line) for the worst-case example in the test dataset, for: DeepONet; NoMaD; FNO; CViT. Also reported is the associated Total Variation (TV) error ($\downarrow$).
Figure 3: Shallow water benchmark. Representative CViT rollout prediction of the vorticity field, and point-wise error against the ground truth.
Figure 4: Ablation studies for CViT on the shallow-water equations benchmark. Convergence of test errors for: (a) different patch sizes; (b) different coordinate embeddings; (c) different resolutions of latent grids; (d) sensitivity on interpolating the CViT latent grid features, as controlled by the $\beta$ parameter. Results obtained using CViT-L with $16 \times 16$ patch-size, varying each hyper-parameter of interest while keeping others fixed.
Figure 5: Incompressible Navier-Stokes benchmark (NS). Representative CViT rollout predictions of the passive scalar field, and point-wise error against the ground truth.
...and 14 more figures

Theorems & Definitions (6)

Proposition D.1
proof
Definition E.1
Theorem E.2
Theorem E.3
proof

CViT: Continuous Vision Transformer for Operator Learning

TL;DR

Abstract

CViT: Continuous Vision Transformer for Operator Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (6)