Table of Contents
Fetching ...

$C^2M^3$: Cycle-Consistent Multi-Model Merging

Donato Crisostomi, Marco Fumero, Daniele Baieri, Florian Bernard, Emanuele Rodolà

TL;DR

A novel data-free method for merging neural networks in weight space that optimizes for the permutations of network neurons globally across all layers, and shows that, when coupled with activation renormalization, this approach yields the best results in the task.

Abstract

In this paper, we present a novel data-free method for merging neural networks in weight space. Differently from most existing works, our method optimizes for the permutations of network neurons globally across all layers. This allows us to enforce cycle consistency of the permutations when merging $N \geq 3$ models, allowing circular compositions of permutations to be computed without accumulating error along the path. We qualitatively and quantitatively motivate the need for such a constraint, showing its benefits when merging sets of models in scenarios spanning varying architectures and datasets. We finally show that, when coupled with activation renormalization, our approach yields the best results in the task.

$C^2M^3$: Cycle-Consistent Multi-Model Merging

TL;DR

A novel data-free method for merging neural networks in weight space that optimizes for the permutations of network neurons globally across all layers, and shows that, when coupled with activation renormalization, this approach yields the best results in the task.

Abstract

In this paper, we present a novel data-free method for merging neural networks in weight space. Differently from most existing works, our method optimizes for the permutations of network neurons globally across all layers. This allows us to enforce cycle consistency of the permutations when merging models, allowing circular compositions of permutations to be computed without accumulating error along the path. We qualitatively and quantitatively motivate the need for such a constraint, showing its benefits when merging sets of models in scenarios spanning varying architectures and datasets. We finally show that, when coupled with activation renormalization, our approach yields the best results in the task.
Paper Structure (54 sections, 2 theorems, 19 equations, 17 figures, 10 tables, 4 algorithms)

This paper contains 54 sections, 2 theorems, 19 equations, 17 figures, 10 tables, 4 algorithms.

Key Result

Theorem 3.1

Given a set of $n$ models $p_0,\dots,p_n$ and object-to-universe permutations $P_i^{p_j}$ computed via eq:gen-frank-wolfe-obj, the pairwise correspondences defined by $P_i^{p_l p_j}={P_i^{p_l}}\circ \left(P_i^{p_j}\right)^{T}$ are cycle-consistent, i.e., for all layer indices $i$, $2\leq j\leq n$.

Figures (17)

  • Figure 1: Cycle-Consistent Multi-Model Merging over three models $A, B, C$. Left: existing methods seek pairwise permutations that map between models; note that $P^{AC} \circ P^{CB}\circ P^{BA} \neq I$ in general, unless this is explicitly enforced. Right: our method computes permutations $P^A$, $P^B$, $P^C$ from each model to a universe$U$, such that a pairwise permutation $P^{BA}$ mapping $A$ to $B$ can be obtained as $P^{BA} = P^{B} (P^{A})^\top$. This way, cycle-consistency is enforced by design and $P^{AC} \circ P^{CB}\circ P^{BA} = I$.
  • Figure 2: Existing methods accumulate error when cyclically mapping a model through a series of permutations, while $C^2M^3$ correctly maps the model back to the starting point.
  • Figure 3: 2D projection of the loss landscape when matching three modes $\Theta_A, \Theta_B, \Theta_C$; the models $\pi(\Theta_A), \pi(\Theta_B), \pi(\Theta_C)$ are their resulting images in the universe, and lie in the same basin. Red zones indicate low-loss regions (typically basins), while blue zones indicate high-loss ones.
  • Figure 4: Accuracy of the interpolated model using Git Re-Basingit-rebasin over different optimization seeds.
  • Figure 5: Cosine similarity of the weights of 5 ResNet20 trained on CIFAR10 with $2\times$ width.
  • ...and 12 more figures

Theorems & Definitions (4)

  • Definition 2.1
  • Theorem 3.1
  • Theorem A.1
  • proof