Table of Contents
Fetching ...

Exploiting Concavity Information in Gaussian Process Contextual Bandit Optimization

Kevin Li, Eric Laber

TL;DR

This work introduces Concave Spline Gaussian Processes (CSGP) to exploit the concavity of the mean reward in action within contextual bandits. By encoding concavity through a spline-based, shape-constrained representation, the posterior becomes a truncated MVN, enabling a UCB policy with regret bounds that scale with the information gain of the constrained model. Empirical results on synthetic tests and Warfarin dosing demonstrate substantial improvements over unconstrained GP and neural baselines, particularly when diminishing returns are pronounced. The approach offers a principled, data-efficient tool for dose-response and pricing problems where the action yields monotone increases up to a point followed by declines.

Abstract

The contextual bandit framework is widely used to solve sequential optimization problems where the reward of each decision depends on auxiliary context variables. In settings such as medicine, business, and engineering, the decision maker often possesses additional structural information on the generative model that can potentially be used to improve the efficiency of bandit algorithms. We consider settings in which the mean reward is known to be a concave function of the action for each fixed context. Examples include patient-specific dose-response curves in medicine and expected profit in online advertising auctions. We propose a contextual bandit algorithm that accelerates optimization by conditioning the posterior of a Bayesian Gaussian Process model on this concavity information. We design a novel shape-constrained reward function estimator using a specially chosen regression spline basis and constrained Gaussian Process posterior. Using this model, we propose a UCB algorithm and derive corresponding regret bounds. We evaluate our algorithm on numerical examples and test functions used to study optimal dosing of Anti-Clotting medication.

Exploiting Concavity Information in Gaussian Process Contextual Bandit Optimization

TL;DR

This work introduces Concave Spline Gaussian Processes (CSGP) to exploit the concavity of the mean reward in action within contextual bandits. By encoding concavity through a spline-based, shape-constrained representation, the posterior becomes a truncated MVN, enabling a UCB policy with regret bounds that scale with the information gain of the constrained model. Empirical results on synthetic tests and Warfarin dosing demonstrate substantial improvements over unconstrained GP and neural baselines, particularly when diminishing returns are pronounced. The approach offers a principled, data-efficient tool for dose-response and pricing problems where the action yields monotone increases up to a point followed by declines.

Abstract

The contextual bandit framework is widely used to solve sequential optimization problems where the reward of each decision depends on auxiliary context variables. In settings such as medicine, business, and engineering, the decision maker often possesses additional structural information on the generative model that can potentially be used to improve the efficiency of bandit algorithms. We consider settings in which the mean reward is known to be a concave function of the action for each fixed context. Examples include patient-specific dose-response curves in medicine and expected profit in online advertising auctions. We propose a contextual bandit algorithm that accelerates optimization by conditioning the posterior of a Bayesian Gaussian Process model on this concavity information. We design a novel shape-constrained reward function estimator using a specially chosen regression spline basis and constrained Gaussian Process posterior. Using this model, we propose a UCB algorithm and derive corresponding regret bounds. We evaluate our algorithm on numerical examples and test functions used to study optimal dosing of Anti-Clotting medication.

Paper Structure

This paper contains 32 sections, 16 theorems, 58 equations, 3 figures.

Key Result

Lemma 1

Suppose that we observe $\bm{y}_{t-1} = \left(y_1, \ldots, y_{t-1}\right)^T$. Define $\boldsymbol{\beta}_{t} := \left\lbrace \boldsymbol{\beta}(\bm{x}_1), \dots, \boldsymbol{\beta}(\bm{x}_{t}) \right\rbrace^T$ to be the concatenated vector of coefficients from every context up to time $t$. Let $\mat

Figures (3)

  • Figure 1: Illustrations of conditioning on concavity information.
  • Figure 2: Cumulative regret for numerical simulation study. The proposed methods, CSGP-Thompson and CSGP-UCB, obtain markedly lower regret than competing methods across a range of problem dimensions and scalings.
  • Figure 3: Cumulative regret for the Warfarin dosing test function experiment. The proposed methods, CSGP-Thompson and CSGP-UCB, produced lower cumulative regret than competing methods.

Theorems & Definitions (29)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Proposition 1
  • proof
  • Lemma 4
  • proof
  • proof
  • proof
  • Lemma 5
  • ...and 19 more