Table of Contents
Fetching ...

Benchmarking neural surrogates on realistic spatiotemporal multiphysics flows

Runze Mao, Rui Zhang, Xuan Bai, Tianhao Wu, Teng Zhang, Zhenyi Chen, Minqi Lin, Bocheng Zeng, Yangchen Xu, Yingxuan Xiang, Haoze Zhang, Shubham Goswami, Pierre A. Dawe, Yifan Xu, Zhenhua An, Mengtao Yan, Xiaoyi Lu, Yi Wang, Rongbo Bai, Haobu Gao, Xiaohang Fang, Han Li, Hao Sun, Zhi X. Chen

TL;DR

REALM introduces a rigorous benchmark to evaluate neural surrogates on realistic multiphysics flows governed by PDE-ODE couplings, using 11 high-fidelity datasets across canonical and industrial scenarios. The authors provide an end-to-end framework with standardized preprocessing, rollout training, and capacity-aligned model presets, enabling fair cross-architecture comparisons among spectral, transformer, CNN, and graph-based surrogates. Across 2D/3D regular and irregular meshes and stiff chemistry, they observe (i) a scaling barrier tied to dimensionality, stiffness, and mesh regularity, (ii) inductive biases dominate performance more than parameter count, and (iii) a persistent gap between nominal metrics and physically faithful long-horizon behavior. The study highlights the need for physics-aware architectures and evaluation criteria focused on conservation and long-horizon fidelity, and offers REALM as a benchmark to drive development of robust surrogates.

Abstract

Predicting multiphysics dynamics is computationally expensive and challenging due to the severe coupling of multi-scale, heterogeneous physical processes. While neural surrogates promise a paradigm shift, the field currently suffers from an "illusion of mastery", as repeatedly emphasized in top-tier commentaries: existing evaluations overly rely on simplified, low-dimensional proxies, which fail to expose the models' inherent fragility in realistic regimes. To bridge this critical gap, we present REALM (REalistic AI Learning for Multiphysics), a rigorous benchmarking framework designed to test neural surrogates on challenging, application-driven reactive flows. REALM features 11 high-fidelity datasets spanning from canonical multiphysics problems to complex propulsion and fire safety scenarios, alongside a standardized end-to-end training and evaluation protocol that incorporates multiphysics-aware preprocessing and a robust rollout strategy. Using this framework, we systematically benchmark over a dozen representative surrogate model families, including spectral operators, convolutional models, Transformers, pointwise operators, and graph/mesh networks, and identify three robust trends: (i) a scaling barrier governed jointly by dimensionality, stiffness, and mesh irregularity, leading to rapidly growing rollout errors; (ii) performance primarily controlled by architectural inductive biases rather than parameter count; and (iii) a persistent gap between nominal accuracy metrics and physically trustworthy behavior, where models with high correlations still miss key transient structures and integral quantities. Taken together, REALM exposes the limits of current neural surrogates on realistic multiphysics flows and offers a rigorous testbed to drive the development of next-generation physics-aware architectures.

Benchmarking neural surrogates on realistic spatiotemporal multiphysics flows

TL;DR

REALM introduces a rigorous benchmark to evaluate neural surrogates on realistic multiphysics flows governed by PDE-ODE couplings, using 11 high-fidelity datasets across canonical and industrial scenarios. The authors provide an end-to-end framework with standardized preprocessing, rollout training, and capacity-aligned model presets, enabling fair cross-architecture comparisons among spectral, transformer, CNN, and graph-based surrogates. Across 2D/3D regular and irregular meshes and stiff chemistry, they observe (i) a scaling barrier tied to dimensionality, stiffness, and mesh regularity, (ii) inductive biases dominate performance more than parameter count, and (iii) a persistent gap between nominal metrics and physically faithful long-horizon behavior. The study highlights the need for physics-aware architectures and evaluation criteria focused on conservation and long-horizon fidelity, and offers REALM as a benchmark to drive development of robust surrogates.

Abstract

Predicting multiphysics dynamics is computationally expensive and challenging due to the severe coupling of multi-scale, heterogeneous physical processes. While neural surrogates promise a paradigm shift, the field currently suffers from an "illusion of mastery", as repeatedly emphasized in top-tier commentaries: existing evaluations overly rely on simplified, low-dimensional proxies, which fail to expose the models' inherent fragility in realistic regimes. To bridge this critical gap, we present REALM (REalistic AI Learning for Multiphysics), a rigorous benchmarking framework designed to test neural surrogates on challenging, application-driven reactive flows. REALM features 11 high-fidelity datasets spanning from canonical multiphysics problems to complex propulsion and fire safety scenarios, alongside a standardized end-to-end training and evaluation protocol that incorporates multiphysics-aware preprocessing and a robust rollout strategy. Using this framework, we systematically benchmark over a dozen representative surrogate model families, including spectral operators, convolutional models, Transformers, pointwise operators, and graph/mesh networks, and identify three robust trends: (i) a scaling barrier governed jointly by dimensionality, stiffness, and mesh irregularity, leading to rapidly growing rollout errors; (ii) performance primarily controlled by architectural inductive biases rather than parameter count; and (iii) a persistent gap between nominal accuracy metrics and physically trustworthy behavior, where models with high correlations still miss key transient structures and integral quantities. Taken together, REALM exposes the limits of current neural surrogates on realistic multiphysics flows and offers a rigorous testbed to drive the development of next-generation physics-aware architectures.

Paper Structure

This paper contains 10 sections, 19 equations, 20 figures, 20 tables.

Figures (20)

  • Figure 1: Overview of REALM benchmark and problem setting.a, Multiphysics reactive flow. Illustration of a jet-flame configuration where large-scale coherent motions interact with small-scale eddies; at the flame front, scalar diffusion regularizes steep gradients. b, Coupled PDE--ODE dynamics. The reactive flow follows the compressible Navier--Stokes system with diffusion and chemistry (as sketched in the panel). Here, $\mathbf{q}$ denotes the conserved state vector; $\mathcal{F}(\mathbf{q})$ are the convective fluxes; $\mathcal{D}(\mathbf{q},\nabla\mathbf{q})$ collects diffusive contributions; and $\mathcal{S}(\mathbf{q})$ is the source obtained from the chemical ODE system. The panel highlights severe scale separation (chemistry $10^{-12}-10^{-9}$ s vs. flow $\mathcal{O}(10^{-1})$ s), the coexistence of fast and slow pathways, turbulence mixing across wavenumbers, and the contrasting signatures of convection and diffusion in concentration profiles. c, Dataset examples used in REALM, spanning canonical problems, high-Mach reactive flows, propulsion-engine scenarios, and fire-hazard cases. d, REALM training and evaluation protocol: multi-scale preprocessing and training with autoregressive. Inputs/outputs are shown for multiple operating conditions $\{C_1,C_2,\ldots\}$ with $N_p$ fields per state. e, REALM multi-scale preprocessing: species mass fractions undergo a box-cox-type transform $\mathcal{F}_{\mathrm{BCT}}$ to compress dynamic range from $\mathcal{O}(10^{-k})$ to $\mathcal{O}(1)$, followed by $z$-score normalization for all variables. f, Surrogate model families supported in REALM: the framework is model-agnostic and applies the same protocol across operator families, including spectral operators, convolutional backbones, transformer-style models, pointwise models, and mesh/graph or point-cloud models.
  • Figure 1: 2D regular cases: quantitative errors and visual comparisons.a, IgnitHIT. Left: snapshots of OH mass fraction $Y_{\mathrm{OH}}$ for the reference and the remaining surrogates; arrows indicate increasing time. Right: temporal evolution of the averaged correlation coefficients between the predicted fields and the reference. b, EvolveJet. Left: snapshots of $Y_{\mathrm{OH}}$ for the reference and the remaining surrogates. Right: temporal evolution of the averaged correlation coefficients between the predicted fields and the reference. c, PlanarDet. Maximum pressure fields $p_{\max}$ at representative times ($t_{30}$, $t_{50}$) for the reference and the surrogates, together with error maps at $t_{50}$ showing $|p_{\max}^{\mathrm{gt}}(t_{50}) - p_{\max}^{\mathrm{pred}}(t_{50})|$. d, PlanarDet. Left: temporal evolution of the averaged correlation coefficients between the predicted fields and the reference. Middle: temporal evolution of the mean detonation cell size for the reference and the surrogates. Right: temporal evolution of the detonation front location for the reference and the surrogates.
  • Figure 2: Dataset statistics and packaging.a, Taxonomy of all cases in the suite, grouped by scenario [Canonical Problems (CP), High-Mach Reacting Flows (HF), Propulsion Engines (PE), and Fire Hazards (FH)], by mesh type (regular vs. irregular), and by dimensionality (2D/3D). b, Global scale of the suite: per-case bars show the number of trajectories ($N_{\mathrm{traj}}$), the number of physical variables, and the total data volume. c, Dynamic range of key species in the EvolveJet case, shown as violin plots of mass fraction at three representative times. Dots and red bars mark the mean and mean $\pm$ standard deviation. d, Cell-quality landscape for a representative irregular case (MultiCoaxFlame). A 2D histogram over cell non-orthogonality and cell volume shows hybrid topology with regular and non-orthogonal regions, illustrating geometric heterogeneity faced by irregular-mesh surrogates. e, Spatial variability in a 2D detonation example. Left: spatial distribution of the peak pressure $p_{\max}$ at a selected time. Right: spatial profiles along four horizontal and vertical sample lines at the same time, revealing intermittent peaks and strongly nonstationary structure across the grid. f, Data organization. Each case contains multiple trajectories; each trajectory consists of a sequence of time steps; each time step stores a stack of physical variables. Thumbnails on the right depict typical fields for reference.
  • Figure 2: 3D regular cases: quantitative errors and visual comparisons.a, ReactTGV. Left: temporal evolution of the averaged cross-correlation coefficients and, at the final time, a comparison of the turbulent energy spectra between the predicted fields and the reference. Right: vorticity isosurfaces colored by velocity magnitude $|u|$ for the reference and for the predictions given by CROP; arrows indicate increasing time. b, PoolFire. Temporal evolution of the averaged cross correlation coefficients between the predicted fields and the reference. c, PoolFire. Temperature isosurfaces at three representative times ($t_{1}, t_{20}, t_{40}$), colored by oxygen mass fraction $Y_{\mathrm{O_2}}$, for the reference and surrogates. d, PropHIT. Left: temporal evolution of the averaged cross correlation coefficients between the predicted fields and the reference. Right: vorticity isosurfaces colored by velocity magnitude $|u|$ for the reference and surrogates.
  • Figure 3: 2D regular cases: quantitative errors and visual comparisons.a, IgnitHIT. Left: rollout error evolution (relative $\ell_2$) over time. Right: snapshots of OH mass fraction $Y_{\mathrm{OH}}$ and streamwise velocity $u$ for the reference and representative surrogates; arrows indicate increasing time. b, EvolveJet. Left: rollout error evolution (relative $\ell_2$). Right: snapshots of $Y_{\mathrm{OH}}$ and $u$ for the reference and representative surrogates. c, PlanarDet. Left: rollout error evolution (relative $\ell_2$). Right: maximum pressure fields $p_{\max}$ at representative times ($t_{30}$, $t_{50}$) for the reference and representative surrogates, together with error maps at $t_{50}$ showing $|p_{\max}^{\mathrm{gt}}(t_{50}) - p_{\max}^{\mathrm{pred}}(t_{50})|$.
  • ...and 15 more figures