Table of Contents
Fetching ...

Functional multi-armed bandit and the best function identification problems

Yuriy Dorn, Aleksandr Katrutsa, Ilgam Latypov, Anastasiia Soboleva

TL;DR

This work introduces functional multi-armed bandit (FMAB) and best function identification (BFI) as extensions of MAB and BFI where arms are unknown black-box functions. It proposes the F-LCB algorithm, a UCB-style reduction that uses base optimization convergence rates to build lower confidence bounds and select arms, with regret guarantees in deterministic and stochastic settings. Theoretical results include regret bounds for both FMAB and deterministic BFI across several function classes, and empirical validation across synthetic smooth/nonsmooth problems and a CIFAR10 neural-network architecture selection task. The framework enables principled, online function-level optimization and model selection with reduced evaluation cost, with applications in competitive large language models and context-aware advertising. Potential extensions include stochastic lower bounds, duality-based refinements, and non-convex settings for large-scale training scenarios.

Abstract

Bandit optimization usually refers to the class of online optimization problems with limited feedback, namely, a decision maker uses only the objective value at the current point to make a new decision and does not have access to the gradient of the objective function. While this name accurately captures the limitation in feedback, it is somehow misleading since it does not have any connection with the multi-armed bandits (MAB) problem class. We propose two new classes of problems: the functional multi-armed bandit problem (FMAB) and the best function identification problem. They are modifications of a multi-armed bandit problem and the best arm identification problem, respectively, where each arm represents an unknown black-box function. These problem classes are a surprisingly good fit for modeling real-world problems such as competitive LLM training. To solve the problems from these classes, we propose a new reduction scheme to construct UCB-type algorithms, namely, the F-LCB algorithm, based on algorithms for nonlinear optimization with known convergence rates. We provide the regret upper bounds for this reduction scheme based on the base algorithms' convergence rates. We add numerical experiments that demonstrate the performance of the proposed scheme.

Functional multi-armed bandit and the best function identification problems

TL;DR

This work introduces functional multi-armed bandit (FMAB) and best function identification (BFI) as extensions of MAB and BFI where arms are unknown black-box functions. It proposes the F-LCB algorithm, a UCB-style reduction that uses base optimization convergence rates to build lower confidence bounds and select arms, with regret guarantees in deterministic and stochastic settings. Theoretical results include regret bounds for both FMAB and deterministic BFI across several function classes, and empirical validation across synthetic smooth/nonsmooth problems and a CIFAR10 neural-network architecture selection task. The framework enables principled, online function-level optimization and model selection with reduced evaluation cost, with applications in competitive large language models and context-aware advertising. Potential extensions include stochastic lower bounds, duality-based refinements, and non-convex settings for large-scale training scenarios.

Abstract

Bandit optimization usually refers to the class of online optimization problems with limited feedback, namely, a decision maker uses only the objective value at the current point to make a new decision and does not have access to the gradient of the objective function. While this name accurately captures the limitation in feedback, it is somehow misleading since it does not have any connection with the multi-armed bandits (MAB) problem class. We propose two new classes of problems: the functional multi-armed bandit problem (FMAB) and the best function identification problem. They are modifications of a multi-armed bandit problem and the best arm identification problem, respectively, where each arm represents an unknown black-box function. These problem classes are a surprisingly good fit for modeling real-world problems such as competitive LLM training. To solve the problems from these classes, we propose a new reduction scheme to construct UCB-type algorithms, namely, the F-LCB algorithm, based on algorithms for nonlinear optimization with known convergence rates. We provide the regret upper bounds for this reduction scheme based on the base algorithms' convergence rates. We add numerical experiments that demonstrate the performance of the proposed scheme.

Paper Structure

This paper contains 30 sections, 4 theorems, 34 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Consider a deterministic FMAB problem. Then the following regret bound holds for Algorithm alg:f_ucb for all $\tau \in \overline{1, T}$:

Figures (4)

  • Figure 1: Dependence of cumulative regret (left), convergence rate (center), and function values (right) on iterations of F-LCB algorithms for FMAB setup with smooth convex functions (\ref{['eq:exp_smooth']}). We see that regret stops growing after some iterations, and our algorithm minimizes only the function with the smallest minimal value.
  • Figure 2: Dependence of cumulative regret (left), convergence rate (center), and functions values (right) on the number of iterations of F-LCB for FMAB setup with nonsmooth convex functions (\ref{['exp::nonsmooth_func']}). The function with the smallest minimal value among other functions is identified and the smallest minimum is achieved.
  • Figure 3: Dependence of cumulative regret (left), convergence rate (center), and functions values (right) on iterations of the F-LCB algorithm in the FMAB setup for smooth convex functions with inexact oracles. The function with the smallest minimum is found automatically even in the case of a poor initial guess.
  • Figure 4: Test accuracy and loss of the considered models. The shaded area shows $[0.1; 0.9]$ quantiles after running for 10 times. ResNet18 and VGG models are the most efficient for solving the considered task that coincides with the previous studies.

Theorems & Definitions (10)

  • Definition 1
  • Remark 1
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • Lemma 1
  • proof