Table of Contents
Fetching ...

Regime Change Hypothesis: Foundations for Decoupled Dynamics in Neural Network Training

Cristian Pérez-Corral, Alberto Fernández-Hernández, Jose I. Mestre, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ortí

TL;DR

This work examines training dynamics in ReLU networks by focusing on activation-patterns that partition the input space into piecewise-affine regions. It proves a local stability result, showing activation patterns are preserved under small parameter perturbations for almost all inputs, and introduces a regime-change hypothesis where activation-pattern updates stabilize earlier than weight updates. Empirically, activation-pattern convergence outpaces weight convergence across MLPs, CNNs, and Transformers, with an average speedup of approximately $3.85\times$ (except for MNIST in some cases). The combination of a measure-theoretic foundation and architecture-agnostic monitoring suggests new decoupled optimization strategies and improved interpretability for piecewise-linear networks.

Abstract

Despite the empirical success of DNN, their internal training dynamics remain difficult to characterize. In ReLU-based models, the activation pattern induced by a given input determines the piecewise-linear region in which the network behaves affinely. Motivated by this geometry, we investigate whether training exhibits a two-timescale behavior: an early stage with substantial changes in activation patterns and a later stage where weight updates predominantly refine the model within largely stable activation regimes. We first prove a local stability property: outside measure-zero sets of parameters and inputs, sufficiently small parameter perturbations preserve the activation pattern of a fixed input, implying locally affine behavior within activation regions. We then empirically track per-iteration changes in weights and activation patterns across fully-connected and convolutional architectures, as well as Transformer-based models, where activation patterns are recorded in the ReLU feed-forward (MLP/FFN) submodules, using fixed validation subsets. Across the evaluated settings, activation-pattern changes decay 3 times earlier than weight-update magnitudes, showing that late-stage training often proceeds within relatively stable activation regimes. These findings provide a concrete, architecture-agnostic instrument for monitoring training dynamics and motivate further study of decoupled optimization strategies for piecewise-linear networks. For reproducibility, code and experiment configurations will be released upon acceptance.

Regime Change Hypothesis: Foundations for Decoupled Dynamics in Neural Network Training

TL;DR

This work examines training dynamics in ReLU networks by focusing on activation-patterns that partition the input space into piecewise-affine regions. It proves a local stability result, showing activation patterns are preserved under small parameter perturbations for almost all inputs, and introduces a regime-change hypothesis where activation-pattern updates stabilize earlier than weight updates. Empirically, activation-pattern convergence outpaces weight convergence across MLPs, CNNs, and Transformers, with an average speedup of approximately (except for MNIST in some cases). The combination of a measure-theoretic foundation and architecture-agnostic monitoring suggests new decoupled optimization strategies and improved interpretability for piecewise-linear networks.

Abstract

Despite the empirical success of DNN, their internal training dynamics remain difficult to characterize. In ReLU-based models, the activation pattern induced by a given input determines the piecewise-linear region in which the network behaves affinely. Motivated by this geometry, we investigate whether training exhibits a two-timescale behavior: an early stage with substantial changes in activation patterns and a later stage where weight updates predominantly refine the model within largely stable activation regimes. We first prove a local stability property: outside measure-zero sets of parameters and inputs, sufficiently small parameter perturbations preserve the activation pattern of a fixed input, implying locally affine behavior within activation regions. We then empirically track per-iteration changes in weights and activation patterns across fully-connected and convolutional architectures, as well as Transformer-based models, where activation patterns are recorded in the ReLU feed-forward (MLP/FFN) submodules, using fixed validation subsets. Across the evaluated settings, activation-pattern changes decay 3 times earlier than weight-update magnitudes, showing that late-stage training often proceeds within relatively stable activation regimes. These findings provide a concrete, architecture-agnostic instrument for monitoring training dynamics and motivate further study of decoupled optimization strategies for piecewise-linear networks. For reproducibility, code and experiment configurations will be released upon acceptance.
Paper Structure (5 sections, 2 theorems, 13 equations, 1 figure, 2 tables)

This paper contains 5 sections, 2 theorems, 13 equations, 1 figure, 2 tables.

Key Result

Proposition 1

Let $f(x; w)$ be a ReLU-based MLP on $\mathbb{R}^{n_0}$ with parameters $w \in \mathbb{R}^p$. Then there exists a set $B \subset \mathbb{R}^p$ of measure zero with the property that every $w_0 \in \mathbb{R}^p \setminus B$ admits a measure-zero set $Z \subset \mathbb{R}^{n_0}$ for which the followin

Figures (1)

  • Figure 1: Model curves showing the training and validation scores (scale at $y$-left axis), as well as activation and weight convergences (scale at $y$-right axis), for all model–dataset pairs. The train/validation score (TS/VS) corresponds to train/validation accuracy for all models except GPT-2, for which it reflects perplexity. Activation and weights convergence are measured as described before.

Theorems & Definitions (8)

  • Definition 1
  • Definition 2
  • Proposition 1: Local stability of activation patterns
  • proof
  • Remark 1
  • Corollary 1
  • proof
  • Remark 2