Bayesian Optimization for Enhanced Language Models: Optimizing Acquisition Functions
Zishuo Bao, Yibo Liu, Changyutao Qiu
TL;DR
The paper addresses the challenge of hyperparameter tuning in fine-tuning large language models, where training loss and validation performance respond differently to optimization signals. It proposes Bilevel-BO-SWA, a framework that fuses models and employs a bilevel Bayesian optimization strategy, using $EI$ in the outer loop and $UCB$ in the inner loop (with various pairings) to navigate hyperparameter space. Empirical evaluation on RoBERTa-base with the GLUE benchmark shows that the EI–UCB configuration achieves the highest average score of 76.82, outperforming standard fine-tuning by 2.7%, and demonstrates improved loss minimization across tasks. The work also analyzes how acquisition-function interactions influence generalization and discusses limitations such as generalizability to other architectures and computational cost, outlining directions for adaptive, multi-fidelity, and broader-model studies.
Abstract
With the rise of different language model architecture, fine-tuning is becoming even more important for down stream tasks Model gets messy, finding proper hyperparameters for fine-tuning. Although BO has been tried for hyperparameter tuning, most of the existing methods are oblivious to the fact that BO relies on careful choices of acquisition functions, which are essential components of BO that guide how much to explore versus exploit during the optimization process; Different acquisition functions have different levels of sensitivity towards training loss and validation performance; existing methods often just apply an acquisition function no matter if the training and validation performance are sensitive to the acquisition function or not. This work introduces{Bilevel - BO - SWA}, a model fusion approach coupled with a bilevel BO strategy to improve the fine - tunning of large language models. Our work on mixture of acquisition functions like EI and UCB into nested opt loops, where inner loop perform minimization of training loss while outer loops optimized w.r.t. val metric. Experiments on GLUE tasks using RoBERTA - base show that when using EI and UCB, there is an improvement in generalization, and fine - tuning can be improved by up to 2.7%.
