Table of Contents
Fetching ...

Faster, Cheaper, Better: Multi-Objective Hyperparameter Optimization for LLM and RAG Systems

Matthew Barker, Andrew Bell, Evan Thomas, James Carr, Thomas Andrews, Umang Bhatt

TL;DR

This work tackles the challenge of end-to-end multi-objective hyperparameter optimization for LLM and RAG pipelines, addressing cost, latency, safety, and alignment. It introduces a Bayesian optimization framework using the hypervolume indicator $\ abla$HV and the acquisition function qLogNEHVI to efficiently explore noisy, high-dimensional configuration spaces that include LLM and embedding choices. The authors validate their approach on two industry-relevant benchmarks, FinancialQA and MedicalQA, and demonstrate superior Pareto fronts compared to baselines, while releasing the new datasets and offering practitioner guidance on task- and objective-dependence. The study highlights practical considerations for deploying MO-RAG configurations and outlines directions for future improvements, including decoupled evaluations and enhanced safety metrics, to improve real-world applicability.

Abstract

While Retrieval Augmented Generation (RAG) has emerged as a popular technique for improving Large Language Model (LLM) systems, it introduces a large number of choices, parameters and hyperparameters that must be made or tuned. This includes the LLM, embedding, and ranker models themselves, as well as hyperparameters governing individual RAG components. Yet, collectively optimizing the entire configuration in a RAG or LLM system remains under-explored - especially in multi-objective settings - due to intractably large solution spaces, noisy objective evaluations, and the high cost of evaluations. In this work, we introduce the first approach for multi-objective parameter optimization of cost, latency, safety and alignment over entire LLM and RAG systems. We find that Bayesian optimization methods significantly outperform baseline approaches, obtaining a superior Pareto front on two new RAG benchmark tasks. We conclude our work with important considerations for practitioners who are designing multi-objective RAG systems, highlighting nuances such as how optimal configurations may not generalize across tasks and objectives.

Faster, Cheaper, Better: Multi-Objective Hyperparameter Optimization for LLM and RAG Systems

TL;DR

This work tackles the challenge of end-to-end multi-objective hyperparameter optimization for LLM and RAG pipelines, addressing cost, latency, safety, and alignment. It introduces a Bayesian optimization framework using the hypervolume indicator HV and the acquisition function qLogNEHVI to efficiently explore noisy, high-dimensional configuration spaces that include LLM and embedding choices. The authors validate their approach on two industry-relevant benchmarks, FinancialQA and MedicalQA, and demonstrate superior Pareto fronts compared to baselines, while releasing the new datasets and offering practitioner guidance on task- and objective-dependence. The study highlights practical considerations for deploying MO-RAG configurations and outlines directions for future improvements, including decoupled evaluations and enhanced safety metrics, to improve real-world applicability.

Abstract

While Retrieval Augmented Generation (RAG) has emerged as a popular technique for improving Large Language Model (LLM) systems, it introduces a large number of choices, parameters and hyperparameters that must be made or tuned. This includes the LLM, embedding, and ranker models themselves, as well as hyperparameters governing individual RAG components. Yet, collectively optimizing the entire configuration in a RAG or LLM system remains under-explored - especially in multi-objective settings - due to intractably large solution spaces, noisy objective evaluations, and the high cost of evaluations. In this work, we introduce the first approach for multi-objective parameter optimization of cost, latency, safety and alignment over entire LLM and RAG systems. We find that Bayesian optimization methods significantly outperform baseline approaches, obtaining a superior Pareto front on two new RAG benchmark tasks. We conclude our work with important considerations for practitioners who are designing multi-objective RAG systems, highlighting nuances such as how optimal configurations may not generalize across tasks and objectives.

Paper Structure

This paper contains 30 sections, 7 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: A high-level overview of our approach. First, we source the datasets that we will use to optimize our RAG pipeline, define the choices, parameters and hyperparameters that will be optimized over (see Table \ref{['tab:system_params']}), and select the objectives for optimization (e.g., cost, latency, safety, and alignment). Second, we introduce a train-test paradigm for evaluating the performance of RAG pipelines, and use Bayesian optimization (BO) to find the optimal parameter configurations. We find that using BO with the qLogNEHVIdaulton2021parallelament2023unexpected acquisition function is well-suited for this problem, since it is adapted for noisy objective evaluations and makes use of a single composite objective called hypervolume improvement that allows for an arbitrary number of objectives. Third, we explore the Pareto frontier of parameter configurations, finding the best solutions over different objectives.
  • Figure 2: HV improvement on train and test splits for both datasets. Our proposed acquisition function for BO (qLogNEHVI) outperforms its noiseless variant (qLogEHVI) and both BO algorithms perform significantly better than the baselines. There is a noticeable increase in HV after iteration 20 (dotted line), indicating the end of Sobol sampling initializations for the BO algorithms, and the start of acquisition function-guided selections.
  • Figure 3: 2D projections of the 4D Pareto frontier for each algorithm for a fixed random seed on both datasets. We see our proposed algorithm (qLogNEHVI BayesOpt) obtains a superior Pareto front, with solutions concentrated towards high safety, high alignment, low cost, and low latency.
  • Figure 4: Radar charts comparing the four objective function evaluations for iterations chosen to optimize each objective. We see that improved safety can be achieved at the expense of increased cost and latency. N.B. Lower is better for cost and latency but higher is better for safety and alignment.