Table of Contents
Fetching ...

Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression

Akira Sakai, Yuma Ichikawa

TL;DR

This work formalizes sign lock-in theory, a stopping-time analysis of sign flips under SGD noise, and introduces a gap-based initialization and a lightweight outward-drift regularizer, reducing the effective flip rate to approximately $10^{-3}$ with only about a one-point increase in perplexity.

Abstract

Sub-bit model compression seeks storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck. Across Transformers, CNNs, and MLPs, learned sign matrices resist low-rank approximation and are spectrally indistinguishable from an i.i.d. Rademacher baseline. Despite this apparent randomness, most weights retain their initialization signs; flips primarily occur via rare near-zero boundary crossings, suggesting that sign-pattern randomness is largely inherited from initialization. We formalize this behavior with sign lock-in theory, a stopping-time analysis of sign flips under SGD noise. Under bounded updates and a rare re-entry condition into a small neighborhood around zero, the number of effective sign flips exhibits a geometric tail. Building on this mechanism, we introduce a gap-based initialization and a lightweight outward-drift regularizer, reducing the effective flip rate to approximately $10^{-3}$ with only about a one-point increase in perplexity.

Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression

TL;DR

This work formalizes sign lock-in theory, a stopping-time analysis of sign flips under SGD noise, and introduces a gap-based initialization and a lightweight outward-drift regularizer, reducing the effective flip rate to approximately with only about a one-point increase in perplexity.

Abstract

Sub-bit model compression seeks storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck. Across Transformers, CNNs, and MLPs, learned sign matrices resist low-rank approximation and are spectrally indistinguishable from an i.i.d. Rademacher baseline. Despite this apparent randomness, most weights retain their initialization signs; flips primarily occur via rare near-zero boundary crossings, suggesting that sign-pattern randomness is largely inherited from initialization. We formalize this behavior with sign lock-in theory, a stopping-time analysis of sign flips under SGD noise. Under bounded updates and a rare re-entry condition into a small neighborhood around zero, the number of effective sign flips exhibits a geometric tail. Building on this mechanism, we introduce a gap-based initialization and a lightweight outward-drift regularizer, reducing the effective flip rate to approximately with only about a one-point increase in perplexity.
Paper Structure (155 sections, 30 theorems, 255 equations, 20 figures, 2 tables)

This paper contains 155 sections, 30 theorems, 255 equations, 20 figures, 2 tables.

Key Result

Proposition 3.5

Under the standard bounded-update and descent/noise conditions, there exists an explicit upper bound $g_T^{\mathrm{SGD}}$ such that for all $k\ge 0$, Moreover, $g_T^{\mathrm{SGD}}$ decreases when (i) the boundary margin $\rho-\epsilon$ grows, (ii) step sizes decay so that $\sum_{t<T}\eta_t^2$ is small, and (iii) mini-batch noise is moderate. Consequently, Assumption ass:reentry_T holds with $g_T:

Figures (20)

  • Figure 1: One-bit wall. Shannon's rate-distortion lower bound for binary sign patterns under Hamming distortion evaluated using an entropy-rate proxy estimated from pretrained weights. Across all models, this proxy is close to one, and the bound is nearly indistinguishable from that of an i.i.d. Rademacher baseline, indicating that sign patterns contain little redundancy.
  • Figure 2: Empirical validation of one-bit wall.(a) SVD compressibility (best rank-$r$ approximation error) of $S=\mathop{\mathrm{sign}}\nolimits(W)$ vs. $A=|W|$ as a function of rank ratio $r/d$ ($d=\min(m,n)$). (b) Spectral fit of sign matrices to an i.i.d. Rademacher baseline using a two-sample KS test on normalized singular values. (c) Initialization-to-trained sign drift in a Transformer trained on next-token prediction: flip ratio vs. initialization across layers (input$\rightarrow$output) and pooled.
  • Figure 3: Sign lock-in validation.Left: Histogram of the effective outer-to-outer flip count $K^{\mathrm{eff}}_T(\rho)$ across scalar weights; see Appendix \ref{['app:charlm_q1q2q3_details']} for $T$, $\rho$, and $\epsilon$. Right: Tail probability ${\mathbb P}[K^{\mathrm{eff}}_T(\rho)\ge k]$ on a log scale for multiple learning rates, with dashed geometric fits of the form $\hat{h}\,\hat{g}^{\,k-1}$.
  • Figure 4: Billion-scale sweep of the lock-in parameters $(\hat{h},\hat{g})$.
  • Figure 5: Flip--quality trade-off with a compressible sign template ($r{=}2$). Validation perplexity mean$\pm$std over three seeds vs. the mean per-step sign flip rate (flip_mean). Each curve fixes the gap threshold $a_{\mathrm{init}}$ and sweeps the log-barrier weight $\lambda$ from left to right ($0.5, 0.3, 0.1, 0.05, 0.01, 0.001, 0.0001$). Stronger stabilization suppresses flips but can worsen perplexity, while intermediate $a_{\text{init}}$ and $\lambda$ achieve large flip reduction with little loss in validation quality.
  • ...and 15 more figures

Theorems & Definitions (68)

  • Definition 3.1: Regions
  • Definition 3.2: Stopping time
  • Proposition 3.5: Informal version of Proposition \ref{['prop:appF_ass2_from_sgd']}: re-entry bound in SGD
  • Theorem 3.6: Sign Lock-in Theorem
  • Remark 3.7
  • Proposition 3.8
  • Remark 4.1
  • Lemma 4.1: Outer-to-outer sign flip forces a boundary visit
  • proof
  • Lemma 4.2: Many effective flips imply many boundary hits
  • ...and 58 more