Table of Contents
Fetching ...

Learning in PINNs: Phase transition, total diffusion, and generalization

Sokratis J. Anagnostopoulos, Juan Diego Toscano, Nikolaos Stergiopulos, George Em Karniadakis

TL;DR

This work analyzes learning dynamics in physics-informed neural networks (PINNs) through gradient signal-to-noise ratio (SNR) under Adam optimization, situating the dynamics within information bottleneck theory. It identifies a novel third phase, total diffusion, where batch gradients become homogeneous and convergence accelerates, and shows that this phase aligns with improved generalization when residuals are diffused uniformly. A residual-based attention (RBA) scheme is proposed to promote gradient and residual homogeneity, speeding entry into total diffusion and enhancing test performance across PINN benchmarks. The study also links SNR transitions to information compression via activation saturation, revealing a layer-wise hierarchy in information flow that supports IB interpretations. The findings offer a principled IB-informed view of PINN optimization and practical strategies to improve generalization for PDE-constrained learning and neural operators.

Abstract

We investigate the learning dynamics of fully-connected neural networks through the lens of gradient signal-to-noise ratio (SNR), examining the behavior of first-order optimizers like Adam in non-convex objectives. By interpreting the drift/diffusion phases in the information bottleneck theory, focusing on gradient homogeneity, we identify a third phase termed ``total diffusion", characterized by equilibrium in the learning rates and homogeneous gradients. This phase is marked by an abrupt SNR increase, uniform residuals across the sample space and the most rapid training convergence. We propose a residual-based re-weighting scheme to accelerate this diffusion in quadratic loss functions, enhancing generalization. We also explore the information compression phenomenon, pinpointing a significant saturation-induced compression of activations at the total diffusion phase, with deeper layers experiencing negligible information loss. Supported by experimental data on physics-informed neural networks (PINNs), which underscore the importance of gradient homogeneity due to their PDE-based sample inter-dependence, our findings suggest that recognizing phase transitions could refine ML optimization strategies for improved generalization.

Learning in PINNs: Phase transition, total diffusion, and generalization

TL;DR

This work analyzes learning dynamics in physics-informed neural networks (PINNs) through gradient signal-to-noise ratio (SNR) under Adam optimization, situating the dynamics within information bottleneck theory. It identifies a novel third phase, total diffusion, where batch gradients become homogeneous and convergence accelerates, and shows that this phase aligns with improved generalization when residuals are diffused uniformly. A residual-based attention (RBA) scheme is proposed to promote gradient and residual homogeneity, speeding entry into total diffusion and enhancing test performance across PINN benchmarks. The study also links SNR transitions to information compression via activation saturation, revealing a layer-wise hierarchy in information flow that supports IB interpretations. The findings offer a principled IB-informed view of PINN optimization and practical strategies to improve generalization for PDE-constrained learning and neural operators.

Abstract

We investigate the learning dynamics of fully-connected neural networks through the lens of gradient signal-to-noise ratio (SNR), examining the behavior of first-order optimizers like Adam in non-convex objectives. By interpreting the drift/diffusion phases in the information bottleneck theory, focusing on gradient homogeneity, we identify a third phase termed ``total diffusion", characterized by equilibrium in the learning rates and homogeneous gradients. This phase is marked by an abrupt SNR increase, uniform residuals across the sample space and the most rapid training convergence. We propose a residual-based re-weighting scheme to accelerate this diffusion in quadratic loss functions, enhancing generalization. We also explore the information compression phenomenon, pinpointing a significant saturation-induced compression of activations at the total diffusion phase, with deeper layers experiencing negligible information loss. Supported by experimental data on physics-informed neural networks (PINNs), which underscore the importance of gradient homogeneity due to their PDE-based sample inter-dependence, our findings suggest that recognizing phase transitions could refine ML optimization strategies for improved generalization.
Paper Structure (22 sections, 50 equations, 14 figures)

This paper contains 22 sections, 50 equations, 14 figures.

Figures (14)

  • Figure 1: Phase transition in PINNs: The test error between the prediction and the exact solution converges faster after total diffusion (dashed lines), which occurs with an abrupt phase transition defined by homogeneous residuals. Although the convergence starts during the onset of the diffusion phase, the optimal training performance is met when the gradients of different batches become equivalent, indicating a general agreement on the direction of the optimizer steps (total diffusion).
  • Figure 2: Gradient-based optimization regimes: Indicative $\text{SNR}$ training curve at each full-batch iteration. For $\text{SNR} \gg 1$, the deterministic term dominates, while for $\text{SNR} \ll 1$, each step becomes more stochastic. The first two stages of learning are defined as "fitting" ($\text{SNR} \gg 1$) and "diffusion" ($\text{SNR} < \mathcal{O}(1)$). The "total diffusion" starts when the batch gradients are approximately equivalent, met with an abrupt increase of the SNR, which typically stabilizes above $\mathcal{O}(1)$. During the final stages, SNR decreases as the signal (numerator) tends to zero and some noise (denominator) persists.
  • Figure 3: Batch-wise SNR directions: Indicative directions in the parameter space for each SNR case. For $\text{SNR}>1$, there is an agreement of directions among samples $x_i$, resulting in a deterministic $\frac{\partial\mathcal{L}}{\partial \theta}$ of large magnitude. When $\text{SNR} = 0$, the directions cancel out, indicating convergence to a local minimum. Finally, for $\text{SNR} = 1$, there is an equilibrium between determinism and stochasticity in the direction of $\frac{\partial\mathcal{L}}{\partial \theta}$, for which the magnitude is non-zero.
  • Figure 4: Landscape scaling with Adam: Indicative loss landscape for two parameters (a), convergence of learning rate corrections (b), initial slices at the global minimum ($\theta_1=0, \theta_2=-10$) (c) and scaled slice for $\theta_1$ with the average learning rate correction (d). Adam assigns a larger learning rate for $\theta_1$ on average since it is more sensitive to the loss than $\theta_2$. This has an adaptive scaling effect as the steps on $\theta_1$ gradually decrease faster than for $\theta_2$.
  • Figure 5: Training trajectory: At each step t, Adam aims to maintain a consistent direction based on the gradient signal $\mathbb{E}[\nabla_{\theta}\mathcal{L}]$. However, this does not guarantee agreement in the sample-wise gradients, which can lead to converging in a local minimum where the sample-wise gradients cancel out while being disproportional. Such a scenario could indicate that the model overfits to certain $x_i$ while underfitting to others.
  • ...and 9 more figures