Regime Change Hypothesis: Foundations for Decoupled Dynamics in Neural Network Training
Cristian Pérez-Corral, Alberto Fernández-Hernández, Jose I. Mestre, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ortí
TL;DR
This work examines training dynamics in ReLU networks by focusing on activation-patterns that partition the input space into piecewise-affine regions. It proves a local stability result, showing activation patterns are preserved under small parameter perturbations for almost all inputs, and introduces a regime-change hypothesis where activation-pattern updates stabilize earlier than weight updates. Empirically, activation-pattern convergence outpaces weight convergence across MLPs, CNNs, and Transformers, with an average speedup of approximately $3.85\times$ (except for MNIST in some cases). The combination of a measure-theoretic foundation and architecture-agnostic monitoring suggests new decoupled optimization strategies and improved interpretability for piecewise-linear networks.
Abstract
Despite the empirical success of DNN, their internal training dynamics remain difficult to characterize. In ReLU-based models, the activation pattern induced by a given input determines the piecewise-linear region in which the network behaves affinely. Motivated by this geometry, we investigate whether training exhibits a two-timescale behavior: an early stage with substantial changes in activation patterns and a later stage where weight updates predominantly refine the model within largely stable activation regimes. We first prove a local stability property: outside measure-zero sets of parameters and inputs, sufficiently small parameter perturbations preserve the activation pattern of a fixed input, implying locally affine behavior within activation regions. We then empirically track per-iteration changes in weights and activation patterns across fully-connected and convolutional architectures, as well as Transformer-based models, where activation patterns are recorded in the ReLU feed-forward (MLP/FFN) submodules, using fixed validation subsets. Across the evaluated settings, activation-pattern changes decay 3 times earlier than weight-update magnitudes, showing that late-stage training often proceeds within relatively stable activation regimes. These findings provide a concrete, architecture-agnostic instrument for monitoring training dynamics and motivate further study of decoupled optimization strategies for piecewise-linear networks. For reproducibility, code and experiment configurations will be released upon acceptance.
