Table of Contents
Fetching ...

Preservation Is Not Enough for Width Growth: Regime-Sensitive Selection of Dense LM Warm Starts

Eren Unlu

Abstract

Width expansion offers a practical route to reuse smaller causal-language-model checkpoints, but selecting a widened warm start is not solved by zero-step preservation alone. We study dense width growth as a candidate-selection problem over full training states, including copied weights, optimizer moments, and scheduler state. In a small-scale TinyStories proxy, we compare exact-copy, perturbative, asymmetric-reset, and structured non-clone warm starts under matched continuation budgets. We evaluate zero-step preservation, short-lag probe metrics, and downstream continuation utility in deterministic and stochastic regimes. The picture is mixed and partially replicated through a reduced-pool seed-1 check. Exact-copy symmetric warm starts rank first in every completed 16-step probe and in the completed stochastic 128-step continuations at seed-0 steps 1000 and 2000 plus reduced seed-1 step 2000. By contrast, the structured non-clone challenger wins deterministic 128-step continuation. Early escape from the inherited cloned subspace is therefore not a universal selector: it helps in long deterministic continuation, but it misleads at short lag and under stochastic continuation. The result is narrow but useful: for dense width growth at this scale, preservation is not a universal ranking criterion, and the best replacement signal depends on both regime and lag budget.

Preservation Is Not Enough for Width Growth: Regime-Sensitive Selection of Dense LM Warm Starts

Abstract

Width expansion offers a practical route to reuse smaller causal-language-model checkpoints, but selecting a widened warm start is not solved by zero-step preservation alone. We study dense width growth as a candidate-selection problem over full training states, including copied weights, optimizer moments, and scheduler state. In a small-scale TinyStories proxy, we compare exact-copy, perturbative, asymmetric-reset, and structured non-clone warm starts under matched continuation budgets. We evaluate zero-step preservation, short-lag probe metrics, and downstream continuation utility in deterministic and stochastic regimes. The picture is mixed and partially replicated through a reduced-pool seed-1 check. Exact-copy symmetric warm starts rank first in every completed 16-step probe and in the completed stochastic 128-step continuations at seed-0 steps 1000 and 2000 plus reduced seed-1 step 2000. By contrast, the structured non-clone challenger wins deterministic 128-step continuation. Early escape from the inherited cloned subspace is therefore not a universal selector: it helps in long deterministic continuation, but it misleads at short lag and under stochastic continuation. The result is narrow but useful: for dense width growth at this scale, preservation is not a universal ranking criterion, and the best replacement signal depends on both regime and lag budget.

Paper Structure

This paper contains 26 sections, 8 equations, 2 figures.

Figures (2)

  • Figure 1: Lag-budget and regime reconciliation for the main structured non-clone challenger. Each point shows the validation-loss AUC gap between refsubspace_statecopy and exactcopy_symmetric, computed as refsubspace minus exactcopy, so negative values favor refsubspace_statecopy and positive values favor exactcopy_symmetric. In the deterministic panel, every available series moves downward from the 16-step probe to the 128-step continuation, and the seed-0 step-1000 series plus both step-2000 series cross below zero. In the stochastic panel, all available series remain positive: the seed-0 step-1000 series stays close to zero at 128 steps, while the seed-0 and seed-1 step-2000 series move farther positive.
  • Figure 2: Selector top-1 regret across completed settings. Each cell shows the regret of a selector relative to the best candidate in the same report, with lower values better. The left panel covers the full seed-0 study; the right panel covers the reduced seed-1 replication. Probe KL is the strongest overall low-cost selector, staying at zero regret across all reduced seed-1 settings and all seed-0 deterministic long-horizon settings, but the seed-0 step-1000 stochastic 128-step run shows a small Probe RMS win over Probe KL. Probe escape reaches zero regret only in deterministic 128-step continuation, where the structured non-clone challenger actually wins. Probe RMS remains competitive in short and stochastic settings but misses both deterministic long-horizon reversals. Zero-step KL is often acceptable at short lag, but it fails when deterministic long-horizon continuation favors refsubspace_statecopy.