Table of Contents
Fetching ...

Decom-Renorm-Merge: Model Merging on the Right Space Improves Multitasking

Yuatyong Chaichana, Thanapat Trachu, Peerat Limkonchotiwat, Konpat Preechakul, Tirasan Khandhawit, Ekapol Chuangsuwanich

TL;DR

DRM addresses the challenge that finetuned models diverge in weight parameterization, hindering straightforward entrywise merging. It introduces a four‑step pipeline built around $\mathrm{SVD}$ on concatenated weight deltas to obtain a shared basis, followed by per‑task renormalization, pruning, and merging via sign election and disjoint averaging; the merged delta is then mapped back to the original parameter space. Across vision and language models, including ViT, DeBERTa, T5, and Llama3.1‑8B, DRM achieves state‑of‑the‑art results in both full finetuning and LoRA setups, with renormalization identified as the key factor enabling stable joint representations. The work demonstrates a practical, data‑efficient approach to constructing multitask models by fusing existing finetuned checkpoints without retraining from scratch. Overall, DRM provides a principled, scalable solution to robust knowledge fusion in neural networks with broad applicability to cross‑domain merging.

Abstract

In the era of large-scale training, model merging has evolved into a tool for creating multitasking models efficiently. It enables the knowledge of models to be fused, without the need for heavy computation as required in traditional multitask learning. Existing merging methods often assume that entries at identical positions in weight matrices serve the same function, enabling straightforward entry-wise comparison and merging. However, this assumption overlooks the complexity of finetuned neural networks, where neurons may develop distinct feature compositions, making direct entry-wise merging problematic. We present Decom-Renorm-Merge (DRM), a simple yet effective approach that leverages Singular Value Decomposition to decompose and coordinate weight matrices into an aligned joint space, where entry-wise merging becomes possible. We showcase the effectiveness of DRM across various settings ranging from smaller encoder-based such as ViT and DeBERTa, encoder-decoder-based such as T5, and larger decoder-based such as Llama3.1-8B. Our experimental results show that DRM outperforms several state-of-the-art merging techniques across full finetuning and low-rank adaptation settings. Moreover, our analysis reveals renormalization as the crucial component for creating a robust and even joint space for merging, significantly contributing to the method's performance.

Decom-Renorm-Merge: Model Merging on the Right Space Improves Multitasking

TL;DR

DRM addresses the challenge that finetuned models diverge in weight parameterization, hindering straightforward entrywise merging. It introduces a four‑step pipeline built around on concatenated weight deltas to obtain a shared basis, followed by per‑task renormalization, pruning, and merging via sign election and disjoint averaging; the merged delta is then mapped back to the original parameter space. Across vision and language models, including ViT, DeBERTa, T5, and Llama3.1‑8B, DRM achieves state‑of‑the‑art results in both full finetuning and LoRA setups, with renormalization identified as the key factor enabling stable joint representations. The work demonstrates a practical, data‑efficient approach to constructing multitask models by fusing existing finetuned checkpoints without retraining from scratch. Overall, DRM provides a principled, scalable solution to robust knowledge fusion in neural networks with broad applicability to cross‑domain merging.

Abstract

In the era of large-scale training, model merging has evolved into a tool for creating multitasking models efficiently. It enables the knowledge of models to be fused, without the need for heavy computation as required in traditional multitask learning. Existing merging methods often assume that entries at identical positions in weight matrices serve the same function, enabling straightforward entry-wise comparison and merging. However, this assumption overlooks the complexity of finetuned neural networks, where neurons may develop distinct feature compositions, making direct entry-wise merging problematic. We present Decom-Renorm-Merge (DRM), a simple yet effective approach that leverages Singular Value Decomposition to decompose and coordinate weight matrices into an aligned joint space, where entry-wise merging becomes possible. We showcase the effectiveness of DRM across various settings ranging from smaller encoder-based such as ViT and DeBERTa, encoder-decoder-based such as T5, and larger decoder-based such as Llama3.1-8B. Our experimental results show that DRM outperforms several state-of-the-art merging techniques across full finetuning and low-rank adaptation settings. Moreover, our analysis reveals renormalization as the crucial component for creating a robust and even joint space for merging, significantly contributing to the method's performance.

Paper Structure

This paper contains 66 sections, 6 theorems, 33 equations, 7 figures, 15 tables.

Key Result

Proposition 1

Let $V^T$ from a horizontally concatenated SVD be partitioned into task-specific blocks, e.g. $V^T = [V_A^T \;\; V_B^T]$ for two tasks. For any given row index $i$, the squared norms of the corresponding row vectors $v_{A,i}^T$ and $v_{B,i}^T$ must satisfy $\|v_{A,i}^T\|_2^2 + \|v_{B,i}^T\|_2^2 = 1$

Figures (7)

  • Figure 1: Decom-Renorm-Merge (DRM) is a model-merging method for building multitask models. Different models may not share the same weight parameterization. Thus, merging should occur in a shared decomposed weight space, not the original parameter space. DRM merges models' weight deltas$\Delta W^{(t)}$---the difference of each finetuned model from a shared base model---into a single merged delta. DRM consists of four main steps: (a) Decompose: Concatenate the $\Delta W^{(t)}$ matrices horizontally (or vertically), then apply SVD to decompose them into a shared basis $U$ and individual weights $V_t$. Although the combined $V$ is orthonormal, each individual split $V_t$ is not. (b) Renormalize each row vector $v_{t,i}$ of $V_t$ to unit length, and scale the corresponding singular value to preserve the magnitude. This can be viewed as compensating for $V_t$’s non-orthonormality as illustrated in (c). (d) Prune each renormalized individual singular vector matrix $\tilde{V}_{t}$ by keeping only the entries within top-$k$% magnitudes. (e) Merge the pruned singular vector matrices across models using sign election and disjoint averaging.
  • Figure 2: Results obtained when merging different number of tasks. DRM-H maintains better performance as the number of merged tasks increased.
  • Figure 3: Percentages of entries being dropped from each row basis vector $v_{t,i}$ during the pruning of entire matrices: (Top) the middle LoRA layer (16th) of Llama-3.1 8B, with and without renormalization. (Bottom) weight deltas of the middle layer (6th) of ViT-B/32, with and without renormalization. Here, we prune $50\%$ of each matrix. The dropped percentages fluctuate severely without renormalization, with some vectors almost entirely zeroed out.
  • Figure 4: A matrix viewpoint of DRM-H. (First row) Horizontal joint decomposition. (Second row) Renormalization on row basis vectors. (Third row) Pruning of the right singular vector matrices. (Fourth row) Sign election and disjoint averaging of right singular vector matrices.
  • Figure 5: Histogram of sign agreement after pruning the original and the Decompose-renormalized spaces of Vit-B/32 and LLaMA-3.1 8B. For each position in the weight delta matrix, we tally the agreement in sign across different tasks. Then, the agreement across all positions are aggregated into a histogram. The range of sign agreement is between $0.5$ and $1.0$, where $0.5$ denotes a position that has an equal number of positive and negative values, while $1.0$ refers to a position where all tasks have the same sign. We visualize the sign agreement in original parameter space (blue), the joint singular space (orange and red), and after projecting joint singular space back to the original space (green and purple).
  • ...and 2 more figures

Theorems & Definitions (10)

  • Proposition 1: Shared Norm Budget of Partitioned Basis Vectors
  • Proposition 2: Bounded Difference of Weight Delta Concatenation
  • Proposition 3: Shared Norm Budget of Partitioned Orthonormal Basis Vectors
  • proof
  • Corollary 3.1: Larger RMS Magnitude in Vector with Larger Norm Share
  • proof
  • Lemma 4: Singular Value Perturbation of a Concatenated Matrix Theorem in perturbationanalysisofsingularvalueinconcatmatrices
  • Lemma 5
  • proof : Proof
  • proof : Proof of Proposition \ref{['prop:Bounded difference of weight delta concatenation']}