Table of Contents
Fetching ...

Vanishing Feature: Diagnosing Model Merging and Beyond

Xingyu Qu, Samuel Horvath

TL;DR

The paper identifies the vanishing feature as a core phenomenon explaining why merging independently initialized models often underperforms and exhibits variance collapse. By decomposing merged activations into input-driven features and input-independent offsets, it links VF to the failure modes of permutation-based merging and normalization tricks, and shows how residual connections and carefully targeted normalization mitigate the issue. It introduces Preserve-First Merging (PFM) to preserve early-layer features and achieve superior merging without post-training, with experiments across CIFAR-10/100, ImageNet, and Transformers supporting its efficacy. Importantly, the authors extend the VF analysis to one-shot pruning, demonstrating that post-pruning normalization can significantly boost pruning performance at high sparsity. Overall, the work provides a cohesive VF-centered framework for understanding and improving model merging and pruning in practical settings.

Abstract

Model merging offers an efficient way to combine pre-trained neural networks but often suffers from inconsistent performance, especially when merging models with different initializations. We identify the ``vanishing feature'' phenomenon, where input-induced features diminish during propagation through the merged model, degrading performance. Through theoretical and empirical analysis, we reveal that this phenomenon underpins challenges like variance collapse and explains techniques like permutation-based merging, post-merging normalization, etc. We show that existing normalization strategies can be enhanced by precisely targeting the vanishing feature issue. Leveraging these insights, we propose the ``Preserve-First Merging'' (PFM) strategy, which focuses on preserving early-layer features, enabling the merged models, for the first time, to outperform the original models in advanced settings without post-training. Furthermore, we demonstrate that the vanishing feature phenomenon extends to other contexts, such as model pruning. Applying post-pruning normalization to mitigate the issue significantly improves one-shot pruning performance at high sparsity, offering a simple and effective post-pruning solution. The code is available at https://github.com/XingyuQu/VF.

Vanishing Feature: Diagnosing Model Merging and Beyond

TL;DR

The paper identifies the vanishing feature as a core phenomenon explaining why merging independently initialized models often underperforms and exhibits variance collapse. By decomposing merged activations into input-driven features and input-independent offsets, it links VF to the failure modes of permutation-based merging and normalization tricks, and shows how residual connections and carefully targeted normalization mitigate the issue. It introduces Preserve-First Merging (PFM) to preserve early-layer features and achieve superior merging without post-training, with experiments across CIFAR-10/100, ImageNet, and Transformers supporting its efficacy. Importantly, the authors extend the VF analysis to one-shot pruning, demonstrating that post-pruning normalization can significantly boost pruning performance at high sparsity. Overall, the work provides a cohesive VF-centered framework for understanding and improving model merging and pruning in practical settings.

Abstract

Model merging offers an efficient way to combine pre-trained neural networks but often suffers from inconsistent performance, especially when merging models with different initializations. We identify the ``vanishing feature'' phenomenon, where input-induced features diminish during propagation through the merged model, degrading performance. Through theoretical and empirical analysis, we reveal that this phenomenon underpins challenges like variance collapse and explains techniques like permutation-based merging, post-merging normalization, etc. We show that existing normalization strategies can be enhanced by precisely targeting the vanishing feature issue. Leveraging these insights, we propose the ``Preserve-First Merging'' (PFM) strategy, which focuses on preserving early-layer features, enabling the merged models, for the first time, to outperform the original models in advanced settings without post-training. Furthermore, we demonstrate that the vanishing feature phenomenon extends to other contexts, such as model pruning. Applying post-pruning normalization to mitigate the issue significantly improves one-shot pruning performance at high sparsity, offering a simple and effective post-pruning solution. The code is available at https://github.com/XingyuQu/VF.
Paper Structure (33 sections, 3 theorems, 31 equations, 18 figures, 16 tables, 2 algorithms)

This paper contains 33 sections, 3 theorems, 31 equations, 18 figures, 16 tables, 2 algorithms.

Key Result

Proposition A.1

Consider two linear networks $\mathbf{f}_0$ and $\mathbf{f}_1$ of depth $L$, defined as: where $i \in \{0, 1\}$, $l \in [L]$, $\mathbf{W}_{i}^{(l)}$ is the weight matrix, and $\mathbf{b}_{i}^{(l)}$ is the bias vector of the $l_{\text{th}}$ layer. The merged network $\mathbf{f}_{\alpha}$ is defined as: with $\mathbf{f}_{\alpha}^{(0)}(\mathbf{x}) = \mathbf{x}$ and $\alpha \in [0,1]$. The intermedi

Figures (18)

  • Figure 1: Vanishing feature phenomenon in merged linear networks $\mathbf{f}_{\alpha=0.5}$. Edge models $\textbf{f}_0,~\textbf{f}_{1}$ are trained on MNIST with different initializations. The mean and standard deviation over the test set are reported in the left and right plots. Statistics denoted by an overline are averaged across edge models. The average absolute value of a matrix $(A_{i,j})_{i\in[m],j\in[n]}$ is given as $\|\mathbf{A}\|_{1,\text{mean}}=\sum_{i,j}|A_{i,j}|/(mn).$Left: The input-induced feature extracted by the merged model ($\mathbf{g}^{(l)}_{0.5}(\mathbf{x})$) progressively diminishes towards zero, causing the input-independent offset ($\mathbf{h}^{(l)}_{0.5}(\mathbf{x})$) dominates the activation in later layers, degrading the model performance. Mid: Parameter magnitudes at each layer decrease after merging, which contributes to the vanishing of features. On the other hand, bias magnitudes increase across layers, shaping the gap between $\|\mathbf{g}^{(l)}_{0.5}(\mathbf{x})\|$ and $\|\textbf{h}^{(l)}_{0.5}\|$ as seen in the left plot. Right: Scaling up $\textbf{g}^{(l)}_{\alpha}(\mathbf{x})$ by the ratio of its magnitude to that of the average magnitude of edge models alleviates the vanishing feature issue, suggesting that the reduced magnitude is one underlying factor.
  • Figure 2: (A & B): Visualize the vanishing feature phenomenon in the merged midpoint model $\mathbf{f}_{0.5}$. Edge models are trained on CIFAR-10. (A) shows output logits of ResNet20s averaged over the test dataset, where the standard deviation is scaled by $0.1\times$ for better visualization. (B) presents the normalized upper bound sequence $(\varepsilon_{\mathcal{D},\mathbf{f}_{0.5}}^{(l)})_{l=1}^{L}$, where the normalization factor is $\overline{\mathbb{E}_{\mathbf{x}}[\mathbf{g}^{(l)}(\mathbf{x})]}$ (see Appendix \ref{['app_subsec:vf_mitigation']}). (C): Comparison of BatchNorm running statistics before and after RESET in the merged VGG16-BN model. (D): Performance evaluation of merging on linear networks, with and without residual connections. Solid curves correspond to WM-based merging.
  • Figure 3: Enhancing merging performance by addressing the vanishing feature issue. We compare four normalization strategies: REPAIR, RESCALE, bias removal, and bias calibration ("Bias Cal."). The latter three are specifically designed to mitigate the vanishing feature phenomenon. Permutations derived via weight matching (Left) and ZipIt! (Right) are applied prior to merging.
  • Figure 4: Left: Evaluation of WM/AM-based merging with only the first $l$ layers premuted before merging. Edge models are VGG16 trained on CIFAR-10. Mid: Evaluating PFM on merging two ResNet50s trained on ImageNet. Right: Illustration of PFM. The input is duplicated and processed independently through preserved layers from each edge model, retaining the latent representations. Outputs of the $l$-th layers are merged using permutations into a unified representation, which is then passed through the fully merged subsequent layers. Normalization is applied to the merged layers.
  • Figure 5: Vanishing feature in pruned models. (A & B): One-shot global magnitude pruning is applied with sparsity levels of 80% for VGG16-BN and 85% for ResNet20, both trained on CIFAR-10. (A) shows average output logits of a pruned VGG16-BN model (standard deviation scaled by $0.1\times$ for clarity). (B) plots the upper bound sequence $(\varepsilon_{\mathcal{D},\mathbf{f}_{p}}^{(l)})_{l=1}^{L},$ normalized by $\mathbb{E}_{\mathbf{x}}[\mathbf{g}^{(l)}(\mathbf{x})]$, highlighting mitigation of the vanishing feature by RESET. (C & D): Similar to merging, applying REPAIR to the pruned model can significantly improve its performance, particularly at a higher sparsity. Lightweight hyperparameters were used for the WoodFisher pruner to reduce computational costs.
  • ...and 13 more figures

Theorems & Definitions (8)

  • Definition 4.1: $\varepsilon_{\mathcal{D},\mathbf{f}}$-Variance Collapse
  • Definition 4.2: $\varepsilon_{\mathcal{D},\mathbf{f}}$-Vanishing Feature in Linear Networks
  • Definition 5.1: $\varepsilon_{\mathcal{D},\mathbf{f}}$-Vanishing Feature
  • Proposition A.1: Restatement of Equation \ref{['eq:merged_linear_model']}
  • proof
  • Proposition A.2: Restatement of Equation \ref{['eq:merged_linear_residual_model']}
  • proof
  • Corollary A.3: Intermediate Activation without Residual Connection at the First Layer