Table of Contents
Fetching ...

COM-BOM: Bayesian Exemplar Search for Efficiently Exploring the Accuracy-Calibration Pareto Frontier

Gaoxiang Luo, Aryan Deshwal

TL;DR

COM-BOM reframes exemplar selection in in-context learning as a multi-objective combinatorial optimization problem, jointly optimizing accuracy $f_{acc}(oldsymbol{z})$ and calibration $f_{ECE}(oldsymbol{z})$ (via $-f_{ECE}(oldsymbol{z})$ for maximization). It introduces a sample-efficient Combinatorial Bayesian Optimization algorithm using Gaussian Process surrogates with an exponentiated Hamming kernel and a hypervolume-based acquisition (NEHVI) to approximate the Pareto front with few LLM evaluations. The method is validated on MMLU-Pro tasks using Qwen3-8B and LLaMA-3.3-70B, showing that COM-BOM discovers better accuracy–calibration trade-offs than baselines, with offline search reducing inference-time costs. This work advances reliable, calibration-aware ICL by delivering Pareto-optimal exemplar sets that support safer, more trustworthy deployment of LLMs in high-stakes settings.

Abstract

Selecting an optimal set of exemplars is critical for good performance of in-context learning. However, prior exemplar search methods narrowly optimize for predictive accuracy, critically neglecting model calibration--a key determinant of trustworthiness and safe deployment. In this paper, we formulate exemplar selection as a multi-objective optimization problem, explicitly targeting both the maximization of predictive accuracy and the minimization of expected calibration error. We solve this problem with a sample-efficient Combinatorial Bayesian Optimization algorithm (COM-BOM) to find the Pareto front that optimally trades off the two objectives of accuracy and calibration. We evaluate COM-BOM on multiple tasks from unsaturated MMLU-Pro benchmark and find that COM-BOM beats or matches the baselines at jointly optimizing the two objectives, while requiring a minimal number of LLM API calls.

COM-BOM: Bayesian Exemplar Search for Efficiently Exploring the Accuracy-Calibration Pareto Frontier

TL;DR

COM-BOM reframes exemplar selection in in-context learning as a multi-objective combinatorial optimization problem, jointly optimizing accuracy and calibration (via for maximization). It introduces a sample-efficient Combinatorial Bayesian Optimization algorithm using Gaussian Process surrogates with an exponentiated Hamming kernel and a hypervolume-based acquisition (NEHVI) to approximate the Pareto front with few LLM evaluations. The method is validated on MMLU-Pro tasks using Qwen3-8B and LLaMA-3.3-70B, showing that COM-BOM discovers better accuracy–calibration trade-offs than baselines, with offline search reducing inference-time costs. This work advances reliable, calibration-aware ICL by delivering Pareto-optimal exemplar sets that support safer, more trustworthy deployment of LLMs in high-stakes settings.

Abstract

Selecting an optimal set of exemplars is critical for good performance of in-context learning. However, prior exemplar search methods narrowly optimize for predictive accuracy, critically neglecting model calibration--a key determinant of trustworthiness and safe deployment. In this paper, we formulate exemplar selection as a multi-objective optimization problem, explicitly targeting both the maximization of predictive accuracy and the minimization of expected calibration error. We solve this problem with a sample-efficient Combinatorial Bayesian Optimization algorithm (COM-BOM) to find the Pareto front that optimally trades off the two objectives of accuracy and calibration. We evaluate COM-BOM on multiple tasks from unsaturated MMLU-Pro benchmark and find that COM-BOM beats or matches the baselines at jointly optimizing the two objectives, while requiring a minimal number of LLM API calls.

Paper Structure

This paper contains 26 sections, 12 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Optimizing for accuracy and calibration error leads to better reliability (top). Two sides of the same coin for self-consistency sampling (bottom).
  • Figure 2: The BO loop with a single-task GP for each objective and multi-objective acquisition function.
  • Figure 3: Illustration of Hypervolume improvement acquisition function for a candidate point (Section \ref{['sec:af_def']}).
  • Figure 4: The reliability diagram of test accuracy and ECE for Math task from MMLU-Pro. It highlights the necessity of optimizing for calibration error that is paramount for deploying trustworthy LLM systems, in order to minimize over-confident wrong predictions and under-confident right predictions. Compared to online retrieval systems for exemplar search, COM-BOM is more cost-effective at inference time due to its offline search.
  • Figure 5: Evolution of best observed hypervolume on the validation data across STEM, Medical and Humanity tasks for optimization approaches. The hypervolume is measured against the reference point (accuracy=0%, ECE=100%). The evolution plots the average of three runs. Please see App. \ref{['sec:more_results']} for results on rest of the tasks.
  • ...and 4 more figures