Table of Contents
Fetching ...

LARV: Data-Free Layer-wise Adaptive Rescaling Veneer for Model Merging

Xinyu Wang, Ke Deng, Fei Dou, Jinbo Bi, Jin Lu

TL;DR

The paper tackles data-free model merging by identifying layer-wise heterogeneity in Vision Transformers and showing that a single global scale is insufficient. It introduces LARV, a data-free veneer that computes per-layer scales from two weight-only diagnostics, $e_\ell$ (information richness) and $c_\ell$ (conflict level), and applies either a continuous or tiered gate to modulate each layer's delta before merging with any base rule. Across FusionBench with ViT backbones and 8/14/20-task settings, LARV yields consistent improvements over diverse baselines, enhances robustness to input corruptions, and generalizes to unseen tasks and even NLP adapters (e.g., T5-LoRA) without additional tuning. The approach provides a principled, low-cost inductive bias that turns merging into a layer-aware process, broadening applicability to multi-task and cross-domain scenarios.

Abstract

Model merging aims to combine multiple fine-tuned models into a single multi-task model without access to training data. Existing task-vector merging methods such as TIES, TSV-M, and Iso-C/CTS differ in their aggregation rules but treat all layers nearly uniformly. This assumption overlooks the strong layer-wise heterogeneity in large vision transformers, where shallow layers are sensitive to interference while deeper layers encode stable task-specific features. We introduce LARV, a training-free, data-free, merger-agnostic Layer-wise Adaptive Rescaling Veneer that plugs into any task-vector merger and assigns a per-layer scale to each task vector before aggregation, and show it consistently boosts diverse merging rules. LARV adaptively suppresses shallow-layer interference and amplifies deeper-layer alignment using a simple deterministic schedule, requiring no retraining or modification to existing mergers. To our knowledge, this is the first work to perform layer-aware scaling for task-vector merging. LARV computes simple data-free layer proxies and turns them into scales through a lightweight rule; we study several instantiations within one framework (e.g., tiered two/three-level scaling with fixed values, or continuous mappings) and show that tiered choices offer the best robustness, while continuous mappings remain an ablation. LARV is orthogonal to the base merger and adds negligible cost. On FusionBench with Vision Transformers, LARV consistently improves all task-vector baselines across 8/14/20-task settings; for example, Iso-C + LARV reaches 85.9% on ViT-B/32, 89.2% on ViT-B/16, and 92.6% on ViT-L/14. Layerwise analysis and corruption tests further indicate that LARV suppresses shallow-layer interference while modestly amplifying deeper, task-stable features, turning model merging into a robust, layer-aware procedure rather than a uniform one.

LARV: Data-Free Layer-wise Adaptive Rescaling Veneer for Model Merging

TL;DR

The paper tackles data-free model merging by identifying layer-wise heterogeneity in Vision Transformers and showing that a single global scale is insufficient. It introduces LARV, a data-free veneer that computes per-layer scales from two weight-only diagnostics, (information richness) and (conflict level), and applies either a continuous or tiered gate to modulate each layer's delta before merging with any base rule. Across FusionBench with ViT backbones and 8/14/20-task settings, LARV yields consistent improvements over diverse baselines, enhances robustness to input corruptions, and generalizes to unseen tasks and even NLP adapters (e.g., T5-LoRA) without additional tuning. The approach provides a principled, low-cost inductive bias that turns merging into a layer-aware process, broadening applicability to multi-task and cross-domain scenarios.

Abstract

Model merging aims to combine multiple fine-tuned models into a single multi-task model without access to training data. Existing task-vector merging methods such as TIES, TSV-M, and Iso-C/CTS differ in their aggregation rules but treat all layers nearly uniformly. This assumption overlooks the strong layer-wise heterogeneity in large vision transformers, where shallow layers are sensitive to interference while deeper layers encode stable task-specific features. We introduce LARV, a training-free, data-free, merger-agnostic Layer-wise Adaptive Rescaling Veneer that plugs into any task-vector merger and assigns a per-layer scale to each task vector before aggregation, and show it consistently boosts diverse merging rules. LARV adaptively suppresses shallow-layer interference and amplifies deeper-layer alignment using a simple deterministic schedule, requiring no retraining or modification to existing mergers. To our knowledge, this is the first work to perform layer-aware scaling for task-vector merging. LARV computes simple data-free layer proxies and turns them into scales through a lightweight rule; we study several instantiations within one framework (e.g., tiered two/three-level scaling with fixed values, or continuous mappings) and show that tiered choices offer the best robustness, while continuous mappings remain an ablation. LARV is orthogonal to the base merger and adds negligible cost. On FusionBench with Vision Transformers, LARV consistently improves all task-vector baselines across 8/14/20-task settings; for example, Iso-C + LARV reaches 85.9% on ViT-B/32, 89.2% on ViT-B/16, and 92.6% on ViT-L/14. Layerwise analysis and corruption tests further indicate that LARV suppresses shallow-layer interference while modestly amplifying deeper, task-stable features, turning model merging into a robust, layer-aware procedure rather than a uniform one.
Paper Structure (56 sections, 17 equations, 5 figures, 9 tables, 1 algorithm)

This paper contains 56 sections, 17 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of LARV merging. (Left) We compute two data-free, weight-derived diagnostics per layer—information richness $e_\ell$ and conflict level $c_\ell$—which exhibit clear depth-dependent trends across 8/14/20-task merges. These signals are converted into layer-wise scaling factors via continuous or tiered mappings. (Right) Each task vector is rescaled at its corresponding layer and merged with any base rule, forming a lightweight veneer that suppresses shallow-layer interference while enhancing deeper-layer alignment.
  • Figure 2: Layer-wise behavior of weight-only metrics. We show effective-rank contrast ($e_\ell$), commutator coefficient ($c_\ell$), and composite score ($s_\ell$) for five data-free merging methods on ViT-B/32 (top) and ViT-B/16 (bottom). Deeper layers exhibit higher $e_\ell$ and lower $c_\ell$, producing monotonically increasing composite scores and motivating our depth-aware scaling rule.
  • Figure 3: Performance on 7 corruption types and overall average. LARV consistently improves robustness across all corruption methods.
  • Figure 4: Sensitivity of the tiered scaling scheme across eight vision tasks. Each heatmap reports the $\Delta$ accuracy (LARV $-$ Base, in percentage points) for a particular choice of tiered scaling coefficients. Positive values (red) indicate improvements, negative values (blue) indicate decreases, and the colormap is centered at zero so that color intensity reflects the magnitude of deviation from the baseline. The results show that moderate adjustments to middle or deep layers consistently provide the largest gains, highlighting the robustness of the tiered design across architectures.
  • Figure 5: Corruption-wise accuracy gains introduced by LARV. Each heatmap visualizes the $\Delta$ accuracy (LARV $-$ Base, in percentage points) across eight corruption types (Motion, Impulse, Gaussian, Pixelate, Spatter, Contrast, JPEG, and Avg.) evaluated on four datasets (Cars, EuroSAT, RESISC45, GTSRB) and their averaged performance. A positive delta indicates that LARV improves robustness, while a negative delta reflects a decrease in accuracy. The color map is centered at zero: red denotes negative changes, blue denotes positive improvements, and darker shades indicate larger magnitude.