Generalization vs. Memorization in Autoregressive Deep Learning: Or, Examining Temporal Decay of Gradient Coherence
James Amarel, Nicolas Hengartner, Robyn Miller, Kamaljeet Singh, Siddharth Mansingh, Arvind Mohan, Benjamin Migliori, Emily Casleton, Alexei Skurikhin, Earl Lawrence, Gerd J. Kunde
TL;DR
The study tackles the problem of genuine generalization versus memorization in autoregressive PDE surrogates, showing that long-horizon reliability cannot be inferred from one-step accuracy alone. It introduces a time-aware influence-function framework built on a proximal objective with a neural tangent kernel metric $\eta$ to quantify how training signals propagate across time and across different initial-condition classes, with a formal cross-sample influence matrix $\Pi$ and Lie-derivative diagnostics. Empirical results on compressible Euler (and NS variants) data using UNet and ViT architectures reveal strong time- and class-localization of gradient updates, rapid decay of off-diagonal influence, near-diagonal class-to-class transferability, and an anisotropic NTK spectrum with a few dominant modes. The findings underscore the need for physics-informed Regularization or architecture design to learn coherent dynamical operators, and position influence-function diagnostics as a practical tool for robust validation in high-stakes scientific applications ($t$-translation properties, $\Pi$, and NTK insights).
Abstract
Foundation models trained as autoregressive PDE surrogates hold significant promise for accelerating scientific discovery through their capacity to both extrapolate beyond training regimes and efficiently adapt to downstream tasks despite a paucity of examples for fine-tuning. However, reliably achieving genuine generalization - a necessary capability for producing novel scientific insights and robustly performing during deployment - remains a critical challenge. Establishing whether or not these requirements are met demands evaluation metrics capable of clearly distinguishing genuine model generalization from mere memorization. We apply the influence function formalism to systematically characterize how autoregressive PDE surrogates assimilate and propagate information derived from diverse physical scenarios, revealing fundamental limitations of standard models and training routines in addition to providing actionable insights regarding the design of improved surrogates.
