PLeaS -- Merging Models with Permutations and Least Squares

Anshul Nasery; Jonathan Hayase; Pang Wei Koh; Sewoong Oh

PLeaS -- Merging Models with Permutations and Least Squares

Anshul Nasery, Jonathan Hayase, Pang Wei Koh, Sewoong Oh

TL;DR

This work addresses merging neural networks trained on different datasets and initializations by introducing PLeaS, a two-stage framework that first aligns layerwise features via permutation symmetry and then fits a merged set of weights with layerwise least-squares to emulate the ensemble. The method supports flexible per-layer widths, enabling memory–compute–accuracy tradeoffs, and extends to a data-free variant (PLeaS-free) that uses public data to compute activations. Across ResNet and ViT experiments on DomainNet and fine-grained datasets, PLeaS achieves up to ~15 percentage-point gains over state-of-the-art merging methods at the same compute and often approaches ensemble performance with fewer parameters, while remaining applicable when domain data is unavailable. The approach broadens practical model merging to diverse deployments and architectures, with an open-source implementation for practitioners.

Abstract

The democratization of machine learning systems has made the process of fine-tuning accessible to practitioners, leading to a wide range of open-source models fine-tuned on specialized tasks and datasets. Recent work has proposed to merge such models to combine their functionalities. However, prior approaches are usually restricted to models that are fine-tuned from the same base model. Furthermore, the final merged model is typically required to be of the same size as the original models. In this work, we propose a new two-step algorithm to merge models -- termed PLeaS -- which relaxes these constraints. First, leveraging the Permutation symmetries inherent in the two models, PLeaS partially matches nodes in each layer by maximizing alignment. Next, PLeaS computes the weights of the merged model as a layer-wise Least Squares solution to minimize the approximation error between the features of the merged model and the permuted features of the original models. PLeaS allows a practitioner to merge two models sharing the same architecture into a single performant model of a desired size, even when the two original models are fine-tuned from different base models. We also demonstrate how our method can be extended to address a challenging scenario where no data is available from the fine-tuning domains. We demonstrate our method to merge ResNet and ViT models trained with shared and different label spaces, and show improvement over the state-of-the-art merging methods of up to 15 percentage points for the same target compute while merging models trained on DomainNet and fine-grained classification tasks. Our code is open-sourced at https://github.com/SewoongLab/PLeaS-Merging .

PLeaS -- Merging Models with Permutations and Least Squares

TL;DR

Abstract

Paper Structure (32 sections, 5 equations, 8 figures, 7 tables)

This paper contains 32 sections, 5 equations, 8 figures, 7 tables.

Introduction
Related works
Preliminaries
Method: PLeaS
Extending Git Re-Basin to partial merging
Permuted least squares
Data requirements of PLeaS
Experiments
Experimental Setup
Datasets
Baselines
Merging for the same size
Exploring the model size-accuracy tradeoff
Does PLeaS need data from the training domains?
Merging models with the same initialization
...and 17 more sections

Figures (8)

Figure 1: PLeaS is a two-step algorithm for merging models: The first step (left) finds layer-wise Permutations to match features across models to compute combined features $\tilde{Z_i}$. Features which are similar are merged, while those which are dis-similar are keptseparate. The number of features to be merged depends on the target compute budget, and can be different for each layer. The second step of PLeaS (right) aims to find weights of the merged model which can map the combined features of layer $i$ (i.e., $\tilde{Z}_i$) to those of layer $i+1$ (i.e., $\tilde{Z}_{i+1}$) appropriately by solving layer-wise Least Squares problems for each layer.
Figure 2: Partial merging with permutations: We show the construction of the $7 \times 6$ weight matrix $W_i^m$ from two weights of size $5 \times 4$ in the first step of PLeaS. The merged inputs are copied and unpermuted to approximate the original inputs. Then we apply both weight matrices separately. Finally, we pair up the merged outputs and average the pairs. Since all operations used are linear, we can fuse them to construct $W_i^m$ using a single linear layer.
Figure 3: Memory-Performance trade-off for merged models: We merge pairs of models fine-tuned on different datasets, and compute the average performance across all four datasets for two settings: datasets with a shared label space (top) and datasets with different label spaces (bottom). Plotting average accuracy against the final merged model size, we find that PLeaS dominates the state-of-the-art methods.
Figure 4: Investigating the data requirement of PLeaS: We run PLeaS and PLeaS-Weight using data from the actual domains or ImageNet (indicated by the suffix free) for both the Shared label space (\ref{['fig:data_same_label_spaces']}) and Different label spaces (\ref{['fig:data_diff_label_spaces']}) settings for ResNet-50. We plot the average accuracy across all datasets against the relative size of the output model. We find minimal performance drops for PLeaS-free .
Figure 5: Comparing our strategy for layer-wise merging with a linear baseline: We merge models using PLeaS and permutations using the strategy described in \ref{['app:qp']} and a linear strategy where $\frac{k}{d}$ is held constant.
...and 3 more figures

PLeaS -- Merging Models with Permutations and Least Squares

TL;DR

Abstract

PLeaS -- Merging Models with Permutations and Least Squares

Authors

TL;DR

Abstract

Table of Contents

Figures (8)