Table of Contents
Fetching ...

Deterministic Continuous Replacement: Fast and Stable Module Replacement in Pretrained Transformers

Rowan Bradbury, Aniket Srinivasan Ashok, Sai Ram Kasanagottu, Gunmay Jhingran, Shuai Meng

TL;DR

Deterministic Continuous Replacement (DCR) addresses instability when replacing modules in pretrained transformers by deterministically blending teacher and student outputs with a scheduling gate, eliminating gate-induced gradient variance common to stochastic gating. The authors formalize and prove that this reduces gradient variance and curvature bias, and they introduce Deep Feature Guidance (DFG) as a near-zero-cost alignment term. In controlled experiments replacing self-attention with reinitialized attention on ViT-Small trained on CIFAR-100, DCR (with or without DFG) achieves faster convergence and stronger alignment than stochastic Theseus and distillation baselines. The results support using DCR as a stable, efficient path for heterogeneous operator swaps in frozen-backbone models, with extensions toward compute-saturated regimes and larger architectures.

Abstract

Replacing modules in pretrained models, especially swapping quadratic self-attention for efficient attention alternatives, poses a hard optimization problem: cold-start reinitialization destabilizes frozen backbones. We isolate this core stability challenge in a controlled study. Deterministic Continuous Replacement (DCR) blends teacher and student outputs with a deterministic, annealed weight. Theoretically, DCR eliminates gate-induced gradient variance inherent to stochastic replacement. In a single-seed study, DCR attains faster convergence and stronger alignment than stochastic gating and distillation baselines on controlled attention replacement, establishing a foundation for heterogeneous operator swaps.

Deterministic Continuous Replacement: Fast and Stable Module Replacement in Pretrained Transformers

TL;DR

Deterministic Continuous Replacement (DCR) addresses instability when replacing modules in pretrained transformers by deterministically blending teacher and student outputs with a scheduling gate, eliminating gate-induced gradient variance common to stochastic gating. The authors formalize and prove that this reduces gradient variance and curvature bias, and they introduce Deep Feature Guidance (DFG) as a near-zero-cost alignment term. In controlled experiments replacing self-attention with reinitialized attention on ViT-Small trained on CIFAR-100, DCR (with or without DFG) achieves faster convergence and stronger alignment than stochastic Theseus and distillation baselines. The results support using DCR as a stable, efficient path for heterogeneous operator swaps in frozen-backbone models, with extensions toward compute-saturated regimes and larger architectures.

Abstract

Replacing modules in pretrained models, especially swapping quadratic self-attention for efficient attention alternatives, poses a hard optimization problem: cold-start reinitialization destabilizes frozen backbones. We isolate this core stability challenge in a controlled study. Deterministic Continuous Replacement (DCR) blends teacher and student outputs with a deterministic, annealed weight. Theoretically, DCR eliminates gate-induced gradient variance inherent to stochastic replacement. In a single-seed study, DCR attains faster convergence and stronger alignment than stochastic gating and distillation baselines on controlled attention replacement, establishing a foundation for heterogeneous operator swaps.

Paper Structure

This paper contains 34 sections, 21 equations, 2 figures, 1 table, 1 algorithm.

Figures (2)

  • Figure 1: Interface cosine similarity (cosine similarity of residual outputs) between teacher and student outputs at different layers (Block 0, 7, 11) across training epochs.
  • Figure 2: Validation accuracy during module replacement on CIFAR-100 (ViT-Small/16). Left: epochs. Right: wall-clock time.