Table of Contents
Fetching ...

Data-Free Layer-Adaptive Merging via Fisher Information for Long-to-Short Reasoning LLMs

Tian Xia

Abstract

Model merging has emerged as a practical approach to combine capabilities of specialized large language models (LLMs) without additional training. In the Long-to-Short (L2S) scenario, merging a base model with a long-chain-of-thought reasoning model aims to preserve reasoning accuracy while reducing output length. Existing methods rely on Task Arithmetic and its variants, which implicitly assume that model outputs vary linearly with the merging coefficient -- an assumption we show is systematically violated in L2S settings. We provide the first theoretical justification for layer-adaptive merging: we prove that merging error is bounded by a term proportional to the per-layer Hessian norm (Proposition~1), and establish that the Fisher Information Matrix (FIM) is a principled, computable proxy for this bound via the Fisher-Hessian equivalence at local optima. Building on this theory, we propose \textbf{FIM-Merging}, which computes diagonal FIM using only random token inputs (no domain-specific calibration data required) and uses it to assign per-layer merging coefficients. On the 7B L2S benchmark, FIM-TIES achieves state-of-the-art performance on five out of six evaluation benchmarks, including a \textbf{+6.2} point gain on MATH500 over ACM-TIES (90.2 vs.\ 84.0), while requiring no calibration data. On the 1.5B benchmark, FIM-TIES achieves an average accuracy of \textbf{47.3}, surpassing the previous best ACM-TIES (43.3) by \textbf{+3.9} points, while reducing average response length by \textbf{91.9\%} relative to the long-CoT model. Our framework also provides a unified theoretical explanation for why existing layer-adaptive methods such as ACM empirically outperform uniform merging.

Data-Free Layer-Adaptive Merging via Fisher Information for Long-to-Short Reasoning LLMs

Abstract

Model merging has emerged as a practical approach to combine capabilities of specialized large language models (LLMs) without additional training. In the Long-to-Short (L2S) scenario, merging a base model with a long-chain-of-thought reasoning model aims to preserve reasoning accuracy while reducing output length. Existing methods rely on Task Arithmetic and its variants, which implicitly assume that model outputs vary linearly with the merging coefficient -- an assumption we show is systematically violated in L2S settings. We provide the first theoretical justification for layer-adaptive merging: we prove that merging error is bounded by a term proportional to the per-layer Hessian norm (Proposition~1), and establish that the Fisher Information Matrix (FIM) is a principled, computable proxy for this bound via the Fisher-Hessian equivalence at local optima. Building on this theory, we propose \textbf{FIM-Merging}, which computes diagonal FIM using only random token inputs (no domain-specific calibration data required) and uses it to assign per-layer merging coefficients. On the 7B L2S benchmark, FIM-TIES achieves state-of-the-art performance on five out of six evaluation benchmarks, including a \textbf{+6.2} point gain on MATH500 over ACM-TIES (90.2 vs.\ 84.0), while requiring no calibration data. On the 1.5B benchmark, FIM-TIES achieves an average accuracy of \textbf{47.3}, surpassing the previous best ACM-TIES (43.3) by \textbf{+3.9} points, while reducing average response length by \textbf{91.9\%} relative to the long-CoT model. Our framework also provides a unified theoretical explanation for why existing layer-adaptive methods such as ACM empirically outperform uniform merging.
Paper Structure (26 sections, 1 theorem, 13 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 26 sections, 1 theorem, 13 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1

Let $f: \mathbb{R}^d \rightarrow \mathbb{R}^m$ be the model output function, $\theta_0 \in \mathbb{R}^d$ the base parameters, $\delta = \theta_1 - \theta_0$ the task vector, and $\alpha \in [0,1]$ the merging coefficient. Define the Task Arithmetic merging error as: If $f$ is twice differentiable on $\{\theta_0 + t\delta : t \in [0,1]\}$, then: where $H_f$ denotes the Hessian of $f$ with respect

Figures (4)

  • Figure 1: Overall framework of FIM-Merging. Given a base model $\theta_0$ and a fine-tuned model $\theta_1$, FIM-Merging computes diagonal FIM on $\theta_0$ using $N=8$ random token inputs (no calibration data required) and estimates per-layer task vector norms $\|\delta^l\|^2$. Their product $\hat{\mathcal{F}}^l \cdot \|\delta^l\|^2$ directly instantiates the Hessian bound in Proposition \ref{['prop:hessian']}, and is used to assign layer-adaptive merging coefficients $\alpha^l$ via log-space normalization and sigmoid mapping. Early layers with high FIM scores receive conservative $\alpha^l$, while later layers receive aggressive $\alpha^l$. The resulting coefficients are applied within an enhanced TIES-Merging procedure with gate protection and residual norm calibration to produce the merged model $\theta_m$.
  • Figure 2: Per-layer merging coefficients $\alpha^l$ assigned by FIM-Merging at 1.5B and 7B scales. Early layers receive lower $\alpha$ (conservative merging) due to higher FIM$\times\|\delta\|^2$ scores, consistent with Proposition \ref{['prop:hessian']}. The 7B model shows stronger layer differentiation, reflecting greater variation in per-layer Hessian norm across scales.
  • Figure 3: Accuracy vs. average response length trade-off on 1.5B L2S models. FIM-TIES (ours) achieves the highest accuracy (47.3%) with the shortest response length (411 tokens), simultaneously dominating all baselines on both dimensions. Baseline lengths computed from yao2025acm Table 2.
  • Figure 4: Nonlinearity analysis of Long-to-Short model merging (Qwen2.5 $\to$ Qwen2.5-Math at 1.5B & 7B scales). Left: Per-layer NL Score (at $\alpha=0.5$) decreases monotonically from early to late layers (1.5B mean = 0.240, 7B mean = 0.171). Middle: NL Score vs. relative delta scale (1.5B); submodule linearity holds only at small delta scales ($\lesssim 0.75\times$). Right: Strong positive correlation between NL Score and relative merging error (7B; Pearson $r=0.972$, $p<10^{-17}$), empirically supporting Proposition \ref{['prop:hessian']}.

Theorems & Definitions (2)

  • Proposition 1: Merging Error Bound via Hessian
  • proof : Proof Sketch