Table of Contents
Fetching ...

How to Weight Multitask Finetuning? Fast Previews via Bayesian Model-Merging

Hugo Monzón Maldonado, Thomas Möllenhoff, Nico Daheim, Iryna Gurevych, Mohammad Emtiyaz Khan

TL;DR

This paper tackles the challenge of selecting task weights in multitask finetuning by introducing fast previews based on Bayesian model merging. It recasts merging as weighted surrogate minimization and derives a principled framework using exponential-family posteriors, with variational and mixture-based extensions to improve preview quality. The authors implement practical algorithms (AdamW-SG, IVON-Hess, MultiIVON-Hess) and validate them across vision and language benchmarks, showing that more expressive posteriors yield previews that closely track full multitask finetuning while drastically reducing compute. The results indicate that bayesianly-informed merging provides reliable guidance for weight selection, enabling scalable multitask adaptation for large models and diverse tasks.

Abstract

When finetuning multiple tasks altogether, it is important to carefully weigh them to get a good performance, but searching for good weights can be difficult and costly. Here, we propose to aid the search with fast previews to quickly get a rough idea of different reweighting options. We use model merging to create previews by simply reusing and averaging parameters of models trained on each task separately (no retraining required). To improve the quality of previews, we propose a Bayesian approach to design new merging strategies by using more flexible posteriors. We validate our findings on vision and natural-language transformers. Our work shows the benefits of model merging via Bayes to improve multitask finetuning.

How to Weight Multitask Finetuning? Fast Previews via Bayesian Model-Merging

TL;DR

This paper tackles the challenge of selecting task weights in multitask finetuning by introducing fast previews based on Bayesian model merging. It recasts merging as weighted surrogate minimization and derives a principled framework using exponential-family posteriors, with variational and mixture-based extensions to improve preview quality. The authors implement practical algorithms (AdamW-SG, IVON-Hess, MultiIVON-Hess) and validate them across vision and language benchmarks, showing that more expressive posteriors yield previews that closely track full multitask finetuning while drastically reducing compute. The results indicate that bayesianly-informed merging provides reliable guidance for weight selection, enabling scalable multitask adaptation for large models and diverse tasks.

Abstract

When finetuning multiple tasks altogether, it is important to carefully weigh them to get a good performance, but searching for good weights can be difficult and costly. Here, we propose to aid the search with fast previews to quickly get a rough idea of different reweighting options. We use model merging to create previews by simply reusing and averaging parameters of models trained on each task separately (no retraining required). To improve the quality of previews, we propose a Bayesian approach to design new merging strategies by using more flexible posteriors. We validate our findings on vision and natural-language transformers. Our work shows the benefits of model merging via Bayes to improve multitask finetuning.

Paper Structure

This paper contains 29 sections, 27 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Our goal is to aid the search for good weights in weighted multitask finetuning. We show a performance contour for three tasks with weights $\alpha_1,\alpha_2$ and $\alpha_3$. The well performing regions are in the middle and achieve around $90\%$ accuracy. We create a cheap preview of the contours by using model merging where previously trained models are quickly weighted with many $\hbox{$\hbox{$\boldsymbol{\alpha}$}$}$ values. The preview captures the rough shape of the true contours, encouraging a focus on the good regions.
  • Figure 2: An illustration of our Bayesian approach to improve preview quality for a toy multitask-learning problem with three tasks. The losses $\ell_t$ are defined over a 2-D $\hbox{$\hbox{$\boldsymbol{\theta}$}$}$ space and are weighted by $\alpha_t$ varied in a fixed grid over $[0, 1]$, more details in \ref{['ssec:hypillu']}. Panel (a) shows that parameter averaging $\sum_t \alpha_t \hbox{$\hbox{$\boldsymbol{\theta}$}$}_t$ gives poor preview (red region) of the true performances (gray contour). Each dot corresponds to a weighting option. The quality is improved in panel (b) and (c) where merging strategies using full and more flexible mixture-of-Gaussian posteriors are used, respectively. The cost is slightly increased due to the Hessians and required ensembling.
  • Figure 3: Results on image classification using ResNet-20 on CIFAR-10 with three tasks constructed from different sets of classes. Preview quality improves with the expressiveness of the posterior approximation. Notably, more mixture components improve the preview. Hessian-Weighted previews generated with IVON-Hess and Mixture-Weighted with MultiIVON-Hess. Histograms show that the distribution of weights that achieve a similar accuracy also improves with better posteriors.
  • Figure 4: Results using ViT-B/32 on GTSRB, RESISC45, SVHN (top) and EuroSAT, Cars, Sun397 (bottom). The exact solution shows a large triangular area of well-performing weightings which is better captured by Hessian-Weighted merging. Simple Merging especially fails around the edges, whereas Hessian-Weighted (AdamW-SG) performs much better (right). Similarly we see on the histograms that the Hessian uncovers more high accuracy weights than the Simple Merging.
  • Figure 5: Merging of multitask finetuned RoBERTa models on pairs of sentiment analysis tasks. Model merging provides good previews of weightings for multitask finetuning but some trends (e.g. $\alpha_1\in[0.0,0.5]$ for SST2&RT) are only picked up by better posteriors and Hessian-Weighted merging (AdamW-SG). First-named task is weighted by $\alpha_1$ and the other by $1-\alpha_1$.
  • ...and 5 more figures