Table of Contents
Fetching ...

Generalization vs. Memorization in Autoregressive Deep Learning: Or, Examining Temporal Decay of Gradient Coherence

James Amarel, Nicolas Hengartner, Robyn Miller, Kamaljeet Singh, Siddharth Mansingh, Arvind Mohan, Benjamin Migliori, Emily Casleton, Alexei Skurikhin, Earl Lawrence, Gerd J. Kunde

TL;DR

The study tackles the problem of genuine generalization versus memorization in autoregressive PDE surrogates, showing that long-horizon reliability cannot be inferred from one-step accuracy alone. It introduces a time-aware influence-function framework built on a proximal objective with a neural tangent kernel metric $\eta$ to quantify how training signals propagate across time and across different initial-condition classes, with a formal cross-sample influence matrix $\Pi$ and Lie-derivative diagnostics. Empirical results on compressible Euler (and NS variants) data using UNet and ViT architectures reveal strong time- and class-localization of gradient updates, rapid decay of off-diagonal influence, near-diagonal class-to-class transferability, and an anisotropic NTK spectrum with a few dominant modes. The findings underscore the need for physics-informed Regularization or architecture design to learn coherent dynamical operators, and position influence-function diagnostics as a practical tool for robust validation in high-stakes scientific applications ($t$-translation properties, $\Pi$, and NTK insights).

Abstract

Foundation models trained as autoregressive PDE surrogates hold significant promise for accelerating scientific discovery through their capacity to both extrapolate beyond training regimes and efficiently adapt to downstream tasks despite a paucity of examples for fine-tuning. However, reliably achieving genuine generalization - a necessary capability for producing novel scientific insights and robustly performing during deployment - remains a critical challenge. Establishing whether or not these requirements are met demands evaluation metrics capable of clearly distinguishing genuine model generalization from mere memorization. We apply the influence function formalism to systematically characterize how autoregressive PDE surrogates assimilate and propagate information derived from diverse physical scenarios, revealing fundamental limitations of standard models and training routines in addition to providing actionable insights regarding the design of improved surrogates.

Generalization vs. Memorization in Autoregressive Deep Learning: Or, Examining Temporal Decay of Gradient Coherence

TL;DR

The study tackles the problem of genuine generalization versus memorization in autoregressive PDE surrogates, showing that long-horizon reliability cannot be inferred from one-step accuracy alone. It introduces a time-aware influence-function framework built on a proximal objective with a neural tangent kernel metric to quantify how training signals propagate across time and across different initial-condition classes, with a formal cross-sample influence matrix and Lie-derivative diagnostics. Empirical results on compressible Euler (and NS variants) data using UNet and ViT architectures reveal strong time- and class-localization of gradient updates, rapid decay of off-diagonal influence, near-diagonal class-to-class transferability, and an anisotropic NTK spectrum with a few dominant modes. The findings underscore the need for physics-informed Regularization or architecture design to learn coherent dynamical operators, and position influence-function diagnostics as a practical tool for robust validation in high-stakes scientific applications (-translation properties, , and NTK insights).

Abstract

Foundation models trained as autoregressive PDE surrogates hold significant promise for accelerating scientific discovery through their capacity to both extrapolate beyond training regimes and efficiently adapt to downstream tasks despite a paucity of examples for fine-tuning. However, reliably achieving genuine generalization - a necessary capability for producing novel scientific insights and robustly performing during deployment - remains a critical challenge. Establishing whether or not these requirements are met demands evaluation metrics capable of clearly distinguishing genuine model generalization from mere memorization. We apply the influence function formalism to systematically characterize how autoregressive PDE surrogates assimilate and propagate information derived from diverse physical scenarios, revealing fundamental limitations of standard models and training routines in addition to providing actionable insights regarding the design of improved surrogates.

Paper Structure

This paper contains 16 sections, 17 equations, 38 figures.

Figures (38)

  • Figure 2: Heatmap of two-time influence for our ViTs trained on CE data, shown as a function of perturbation time (horizontal axis) and response time (vertical axis). Each pixel reports the intra-class averaged response induced by test example gradients at the perturbation time. A narrow diagonal ridge corresponds to time-local sensitivity consistent with interpolation, rather than generalization; substantial off-diagonal structure would indicate time-transferable learning. For the analogous plot using our UNets, see \ref{['fig:heatmap_UC']}. For NS data counterparts, see \ref{['fig:heatmap_VN']} and \ref{['fig:heatmap_UN']}.
  • Figure 3: Class-to-class transferability matrix for our ViTs trained on the three-class CE split labeled RP, CRP, and RPUI. Each entry reports the time-averaged influence of test examples from the input class (horizontal axis) on examples from the response class (vertical axis). Diagonal dominance indicates class-locked gradient geometry; substantial off-diagonal values would imply reuse of dynamical features across classes. For the analogous plot using our UNets, see \ref{['fig:diag_UC']}. For NS data counterparts, see \ref{['fig:diag_VN']} and \ref{['fig:diag_UN']}.
  • Figure 4: Time-lag summary of temporal transferability for our ViTs on CE data. The influence is averaged over all time-pairs of the same time difference and then split into intra-class pairs (gradient and response drawn from the same initial-condition class) and inter-class pairs (distinct initial-condition classes). Strong concentration near zero time difference indicates that gradient information fails to propagate coherently across time, while the lack off inter-class influence reveals an absence of physics consistent generalization. For the analogous plot using our UNets, see \ref{['fig:horizon_UC']}. For NS data counterparts, see \ref{['fig:horizon_VN']} and \ref{['fig:horizon_UN']}.
  • Figure 5: Curve fit of the influence as a function of feature-space separation between input states for CE data, comparing UNet and ViT; rangebars show uncertainty across seeds. A steep decay indicates short-range locality on the learned data manifold, implying that parameter updates affect only nearby states and generalization is limited. For NS data counterpart, see \ref{['fig:rkhs_NS']}.
  • Figure 6: Two-time influence map for our ViTs on CE data when the response observable is the global mass-consistency signal. The horizontal axis indexes the time at which a test perturbation is applied, and the vertical axis indexes the time at which the mass-based response is evaluated. Off-diagonal support indicates that intra-class mass-related gradient information couples distant times. For the analogous plot using our UNets, see \ref{['fig:heatmap_UC_mass']}. For plots concerning energy conservation, see \ref{['fig:heatmap_VC_energy']} and \ref{['fig:heatmap_UC_energy']}.
  • ...and 33 more figures