Vanishing Feature: Diagnosing Model Merging and Beyond
Xingyu Qu, Samuel Horvath
TL;DR
The paper identifies the vanishing feature as a core phenomenon explaining why merging independently initialized models often underperforms and exhibits variance collapse. By decomposing merged activations into input-driven features and input-independent offsets, it links VF to the failure modes of permutation-based merging and normalization tricks, and shows how residual connections and carefully targeted normalization mitigate the issue. It introduces Preserve-First Merging (PFM) to preserve early-layer features and achieve superior merging without post-training, with experiments across CIFAR-10/100, ImageNet, and Transformers supporting its efficacy. Importantly, the authors extend the VF analysis to one-shot pruning, demonstrating that post-pruning normalization can significantly boost pruning performance at high sparsity. Overall, the work provides a cohesive VF-centered framework for understanding and improving model merging and pruning in practical settings.
Abstract
Model merging offers an efficient way to combine pre-trained neural networks but often suffers from inconsistent performance, especially when merging models with different initializations. We identify the ``vanishing feature'' phenomenon, where input-induced features diminish during propagation through the merged model, degrading performance. Through theoretical and empirical analysis, we reveal that this phenomenon underpins challenges like variance collapse and explains techniques like permutation-based merging, post-merging normalization, etc. We show that existing normalization strategies can be enhanced by precisely targeting the vanishing feature issue. Leveraging these insights, we propose the ``Preserve-First Merging'' (PFM) strategy, which focuses on preserving early-layer features, enabling the merged models, for the first time, to outperform the original models in advanced settings without post-training. Furthermore, we demonstrate that the vanishing feature phenomenon extends to other contexts, such as model pruning. Applying post-pruning normalization to mitigate the issue significantly improves one-shot pruning performance at high sparsity, offering a simple and effective post-pruning solution. The code is available at https://github.com/XingyuQu/VF.
