Table of Contents
Fetching ...

Harmony in Diversity: Merging Neural Networks with Canonical Correlation Analysis

Stefan Horoi, Albert Manuel Orozco Camacho, Eugene Belilovsky, Guy Wolf

TL;DR

The paper addresses the challenge of merging fully trained neural networks that originate from different initializations or data orders, where permutation-based alignment often fails due to distributed feature representations. It introduces CCA Merge, a method that uses Canonical Correlation Analysis to align linear combinations of neuronal activations across layers, enabling more accurate and scalable model fusion, including scenarios with many models. Empirical results show that CCA Merge consistently outperforms prior baselines across CIFAR, ImageNet, and disjoint data splits, while maintaining robustness as the number of models increases. The approach reduces the performance gap between merged models and ensembles at a lower computational and storage cost, with practical limitations noted around the need for input activations for alignment and associated compute overhead. Overall, CCA Merge provides a principled, flexible alternative to permutation-based merging, advancing the ability to extract and combine common representations learned by diverse networks.

Abstract

Combining the predictions of multiple trained models through ensembling is generally a good way to improve accuracy by leveraging the different learned features of the models, however it comes with high computational and storage costs. Model fusion, the act of merging multiple models into one by combining their parameters reduces these costs but doesn't work as well in practice. Indeed, neural network loss landscapes are high-dimensional and non-convex and the minima found through learning are typically separated by high loss barriers. Numerous recent works have been focused on finding permutations matching one network features to the features of a second one, lowering the loss barrier on the linear path between them in parameter space. However, permutations are restrictive since they assume a one-to-one mapping between the different models' neurons exists. We propose a new model merging algorithm, CCA Merge, which is based on Canonical Correlation Analysis and aims to maximize the correlations between linear combinations of the model features. We show that our alignment method leads to better performances than past methods when averaging models trained on the same, or differing data splits. We also extend this analysis into the harder setting where more than 2 models are merged, and we find that CCA Merge works significantly better than past methods. Our code is publicly available at https://github.com/shoroi/align-n-merge

Harmony in Diversity: Merging Neural Networks with Canonical Correlation Analysis

TL;DR

The paper addresses the challenge of merging fully trained neural networks that originate from different initializations or data orders, where permutation-based alignment often fails due to distributed feature representations. It introduces CCA Merge, a method that uses Canonical Correlation Analysis to align linear combinations of neuronal activations across layers, enabling more accurate and scalable model fusion, including scenarios with many models. Empirical results show that CCA Merge consistently outperforms prior baselines across CIFAR, ImageNet, and disjoint data splits, while maintaining robustness as the number of models increases. The approach reduces the performance gap between merged models and ensembles at a lower computational and storage cost, with practical limitations noted around the need for input activations for alignment and associated compute overhead. Overall, CCA Merge provides a principled, flexible alternative to permutation-based merging, advancing the ability to extract and combine common representations learned by diverse networks.

Abstract

Combining the predictions of multiple trained models through ensembling is generally a good way to improve accuracy by leveraging the different learned features of the models, however it comes with high computational and storage costs. Model fusion, the act of merging multiple models into one by combining their parameters reduces these costs but doesn't work as well in practice. Indeed, neural network loss landscapes are high-dimensional and non-convex and the minima found through learning are typically separated by high loss barriers. Numerous recent works have been focused on finding permutations matching one network features to the features of a second one, lowering the loss barrier on the linear path between them in parameter space. However, permutations are restrictive since they assume a one-to-one mapping between the different models' neurons exists. We propose a new model merging algorithm, CCA Merge, which is based on Canonical Correlation Analysis and aims to maximize the correlations between linear combinations of the model features. We show that our alignment method leads to better performances than past methods when averaging models trained on the same, or differing data splits. We also extend this analysis into the harder setting where more than 2 models are merged, and we find that CCA Merge works significantly better than past methods. Our code is publicly available at https://github.com/shoroi/align-n-merge
Paper Structure (37 sections, 6 equations, 8 figures, 7 tables)

This paper contains 37 sections, 6 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Visual representation of using CCA Merge to align two models. Canonical Correlation Analysis is used to find a common representation space where orthogonal linear combinations of the features (neurons) from $\mathcal{A}$ and $\mathcal{B}$ are maximally correlated. The linear transformation $\mathbf{P}^\mathcal{A}$ (resp. $\mathbf{P}^\mathcal{B}$) and its inverse can be used to go from the representation space of model $\mathcal{A}$ (resp. model $\mathcal{B}$) to this common representation space and back. By applying $\mathbf{P}^\mathcal{B}$ first and then $\mathbf{P}^{\mathcal{A}^{-1}}$ we can align the representations of model $\mathcal{B}$ to those of model $\mathcal{A}$. Applying the same transformation directly to the parameters of model $\mathcal{B}$ effectively aligns the two models, thus allowing their merging.
  • Figure 2: Left column: distribution of correlation values between the neurons $\{\mathbf{z}_i^\mathcal{A}\}_{i=1}^n$ and $\{\mathbf{z}_i^\mathcal{B}\}_{i=1}^n$ of two ResNet20x8 models ($\mathcal{A}$ and $\mathcal{B}$) trained on CIFAR100 at two different merging layers; Right column: for $k\in\{1,2,3,4,5\}$ the distributions of the top $k$-th correlation values for all neurons in model $\mathcal{A}$ at those merging layers.
  • Figure 3: Distributions of top 1 (left column) and 2 (right column) correlations (blue) and CCA Merge transformation coefficients (orange) across neurons from model $\mathcal{A}$ at two different merging layers. In the left column for example, for each neuron $\mathbf{z}_i^\mathcal{A}$ we have one correlation value corresponding to $\max_{1\leq j\leq n} \mathbf{C}_{ij}$ and one coefficient value corresponding to $\max_{1\leq j\leq n} \mathbf{T}_{ij}$ where $\mathbf{C}$ is the cross-correlation matrix between neurons of models $\mathcal{A}$ and $\mathcal{B}$, and $\mathbf{T}$ is the CCA Merge transformation matching neurons of $\mathcal{B}$ to those of $\mathcal{A}$. Wasserstein distance between the distributions of top $k\in\{1,2\}$ correlations and top $k$ Merge CCA coefficients are reported, relative to equivalent distances between correlations and Permute transforms (all top 1 values are 1, and top 2 values are 0).
  • Figure 4: Accuracies of averaging multiple models after feature alignment with different merging methods. Mean and standard deviation across 4 random seeds are shown.
  • Figure 5: Percent (%) of non-optimal matches when merging ResNet20x8 models trained on CIFAR100. The mean and standard deviation across 15 possible 2-model merges out of a group of 6 models fully trained from different initializations are shown.
  • ...and 3 more figures