Table of Contents
Fetching ...

S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

Jack Young

Abstract

Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/- 1.7 pp (10 seeds). On FalconH1-7B (Mamba-2 hybrid), S0 reaches 71.8% +/- 1.3 and LoRA reaches 71.4% +/- 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism. A prefix-tuning control on a pure Transformer (Qwen2.5-3B) degrades performance by -13.9 pp under all nine configurations tested. On Qwen3.5, a per-step state-offset variant reaches +27.1 pp, above both S0 and LoRA but with per-step inference cost. Taken together, the results show that recurrent state initialization is a strong zero-inference-overhead PEFT surface for hybrid language models when verified supervision is scarce. The tuned state is a ~48 MB file; task switching requires no weight merging or model reload. Code and library: https://github.com/jackyoung27/s0-tuning.

S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

Abstract

Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/- 1.7 pp (10 seeds). On FalconH1-7B (Mamba-2 hybrid), S0 reaches 71.8% +/- 1.3 and LoRA reaches 71.4% +/- 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism. A prefix-tuning control on a pure Transformer (Qwen2.5-3B) degrades performance by -13.9 pp under all nine configurations tested. On Qwen3.5, a per-step state-offset variant reaches +27.1 pp, above both S0 and LoRA but with per-step inference cost. Taken together, the results show that recurrent state initialization is a strong zero-inference-overhead PEFT surface for hybrid language models when verified supervision is scarce. The tuned state is a ~48 MB file; task switching requires no weight merging or model reload. Code and library: https://github.com/jackyoung27/s0-tuning.

Paper Structure

This paper contains 47 sections, 4 equations, 4 figures, 10 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of S0 tuning. (a) Cross-architecture comparison on HumanEval: S0 (teal) outperforms LoRA on Qwen3.5-4B and is tied with it on FalconH1-7B (hatched bars). The Qwen comparison is significant ($p < 0.001$; ***); the Falcon 3-seed comparison is statistically indistinguishable from LoRA. Offset is shown for Qwen only (not applicable to FalconH1's recurrence). Error bars show $\pm$1 std across seeds. (b) First-character divergence: of 27 FAIL-to-PASS flips, 23 (85%) diverge from baseline at the very first generated character (teal dots at position 0).
  • Figure 2: Computation graph for S0 tuning. The learned initial state $S_0$ (teal) is injected into each recurrent layer before the first token. After $t{=}1$, it is absorbed into the running state and adds zero computational overhead. All model weights remain frozen.
  • Figure 3: Scaling and architecture-specific tuning. (a) Performance gains increase monotonically with model scale, from $+2.6$ pp at 0.8B to $+44.0$ pp at 9B. Error bars show $\pm 1$ standard deviation across seeds. The 9B model achieves 76.1% absolute accuracy from a 32.1% baseline, suggesting larger models have more latent capability to unlock via state initialization. (b) FalconH1 alpha sweep: the default $\alpha{=}0.07$ yields only $+8.3$ pp, but architecture-specific tuning to $\alpha{=}0.6$--$0.7$ reaches $71.8\%$, matching LoRA's $71.4\%$. Large $\alpha$ ($\ge 2.0$) collapses performance.
  • Figure 4: First-character divergence in FAIL-to-PASS flips (27 flips, single-seed Qwen3.5-4B). (a) 23 of 27 corrected solutions (85%) diverge from baseline at character position 0, the very first generated character. (b) Cumulative distribution of divergence positions: all 27 flips diverge within the first 10% of the completion (mean 0.76%, median 0.0%). $S_0$ shifts the output distribution at the first opportunity; autoregressive decoding amplifies this into a qualitatively different solution.