Table of Contents
Fetching ...

Model Fusion through Bayesian Optimization in Language Model Fine-Tuning

Chaeyun Jang, Hyungi Lee, Jungtaek Kim, Juho Lee

TL;DR

The paper tackles the resource-intensive and hyperparameter-sensitive problem of fine-tuning large language models by introducing BOMF, a Bayesian-optimization-guided model fusion method. It combines a two-stage approach: first search hyperparameters on lightweight model variants to identify strong training trajectories, then use multi-objective Bayesian optimization to determine optimal fusion weights that balance multiple metrics with the training loss. A key insight is the misalignment between loss and metric landscapes in language tasks, which motivates moving beyond simple weight averaging to Pareto-aware fusion that jointly optimizes several objectives. Empirical results across medium- and large-scale LMs on GLUE, SQuAD, SAMSum, KorMedMCQA, and E2E show BOMF achieving superior or robust performance compared to Grid Fine-Tune, SWA variants, and single-metric baselines, with ablations highlighting the benefits of hyperparameter alignment and multi-objective optimization for efficiency and generalization.

Abstract

Fine-tuning pre-trained models for downstream tasks is a widely adopted technique known for its adaptability and reliability across various domains. Despite its conceptual simplicity, fine-tuning entails several troublesome engineering choices, such as selecting hyperparameters and determining checkpoints from an optimization trajectory. To tackle the difficulty of choosing the best model, one effective solution is model fusion, which combines multiple models in a parameter space. However, we observe a large discrepancy between loss and metric landscapes during the fine-tuning of pre-trained language models. Building on this observation, we introduce a novel model fusion technique that optimizes both the desired metric and loss through multi-objective Bayesian optimization. In addition, to effectively select hyperparameters, we establish a two-stage procedure by integrating Bayesian optimization processes into our framework. Experiments across various downstream tasks show considerable performance improvements using our Bayesian optimization-guided method.

Model Fusion through Bayesian Optimization in Language Model Fine-Tuning

TL;DR

The paper tackles the resource-intensive and hyperparameter-sensitive problem of fine-tuning large language models by introducing BOMF, a Bayesian-optimization-guided model fusion method. It combines a two-stage approach: first search hyperparameters on lightweight model variants to identify strong training trajectories, then use multi-objective Bayesian optimization to determine optimal fusion weights that balance multiple metrics with the training loss. A key insight is the misalignment between loss and metric landscapes in language tasks, which motivates moving beyond simple weight averaging to Pareto-aware fusion that jointly optimizes several objectives. Empirical results across medium- and large-scale LMs on GLUE, SQuAD, SAMSum, KorMedMCQA, and E2E show BOMF achieving superior or robust performance compared to Grid Fine-Tune, SWA variants, and single-metric baselines, with ablations highlighting the benefits of hyperparameter alignment and multi-objective optimization for efficiency and generalization.

Abstract

Fine-tuning pre-trained models for downstream tasks is a widely adopted technique known for its adaptability and reliability across various domains. Despite its conceptual simplicity, fine-tuning entails several troublesome engineering choices, such as selecting hyperparameters and determining checkpoints from an optimization trajectory. To tackle the difficulty of choosing the best model, one effective solution is model fusion, which combines multiple models in a parameter space. However, we observe a large discrepancy between loss and metric landscapes during the fine-tuning of pre-trained language models. Building on this observation, we introduce a novel model fusion technique that optimizes both the desired metric and loss through multi-objective Bayesian optimization. In addition, to effectively select hyperparameters, we establish a two-stage procedure by integrating Bayesian optimization processes into our framework. Experiments across various downstream tasks show considerable performance improvements using our Bayesian optimization-guided method.

Paper Structure

This paper contains 48 sections, 7 equations, 10 figures, 19 tables, 1 algorithm.

Figures (10)

  • Figure 1: Visualization of the loss landscape over parameters (\ref{['main:fig:vision_loss', 'main:fig:language_loss']}) and the metric landscape over parameters (\ref{['main:fig:vision_metric', 'main:fig:language_metric']}) for both the vision task (\ref{['main:fig:vision_loss', 'main:fig:vision_metric']}) and the task (\ref{['main:fig:language_loss', 'main:fig:language_metric']}). The metric is $1-$accuracy and F1 score for the vision task and the task, respectively. In the vision task, we fine-tune the ResNet-50 model he2016deep pre-trained with ImageNet-21k russakovsky2015imagenet on the Caltech-101 dataset li_andreeto_ranzato_perona_2022, while in the task, fine-tuning was performed on the pre-trained model on the dataset. The members of the for each figure are denoted as $w_1, w_2, w_3$.
  • Figure 2: Validation results on the MRPC dataset for : loss (shown in left panels) and F1 score (in right panels) for varying learning rates, batch sizes, and frozen layers. Optimal hyperparameters align well across different frozen layers, except when all pre-trained layers are frozen.
  • Figure 3: Correlation between the performance of best-performing weights in a training trajectory and the performance of the fused model. We fine-tune the model on the RTE dataset. Each point is obtained from the evaluation of a single trajectory with varying hyperparameters.
  • Figure 4: Visualization of the loss landscape over parameters (a) and the metric landscape over parameters (b) for the vision task. The metric is accuracy error. We fine-tune the ImageNet-21k pre-trained ViT-B/16 dosovitskiy2021an model on the Caltech-101 dataset. The members of the for each figure are denoted as $w_1, w_2, w_3$. Here we can see similar trends with the ResNet-50 case.
  • Figure 5: Validation loss and metric (F1 score) results for the varying hyperparameter ((a) batch size, (b) learning rate) and the number of rank for the on MRPC dataset. (a) and (b) indicate that the optimal hyperparameters consistently align well across different numbers of rank.
  • ...and 5 more figures