Table of Contents
Fetching ...

A Control Theory inspired Exploration Method for a Linear Bandit driven by a Linear Gaussian Dynamical System

Jonathan Gornet, Yilin Mo, Bruno Sinopoli

TL;DR

The paper addresses exploration-exploitation in a linear bandit where rewards arise from a known Linear Gaussian Dynamical System. It analyzes computational intractability of optimal control and derives regret lower bounds, motivating two practical algorithms: Kalman-UCB and IDEA. Kalman-UCB blends reward prediction with confidence-based perturbations, while IDEA prioritizes actions that minimize Kalman-filter state error, and a metric based on observability and perturbation distributions predicts relative performance. Numerical results across diverse LGDS settings show IDEA often achieving lower regret, particularly in environments with informative feedback dynamics, suggesting a practical advantage for LGDS-driven hyperparameter optimization and similar tasks.

Abstract

The paper introduces a linear bandit environment where the reward is the output of a known Linear Gaussian Dynamical System (LGDS). In this environment, we address the fundamental challenge of balancing exploration -- gathering information about the environment -- and exploitation -- selecting to the action with the highest predicted reward. We propose two algorithms, Kalman filter Upper Confidence Bound (Kalman-UCB) and Information filter Directed Exploration Action-selection (IDEA). Kalman-UCB uses the principle of optimism in the face of uncertainty. IDEA selects actions that maximize the combination of the predicted reward and a term that quantifies how much an action minimizes the error of the Kalman filter state prediction, which depends on the LGDS property called observability. IDEA is motivated by applications such as hyperparameter optimization in machine learning. A major problem encountered in hyperparameter optimization is the large action spaces, which hinder the performance of methods inspired by principle of optimism in the face of uncertainty as they need to explore each action to lower reward prediction uncertainty. To predict if either Kalman-UCB or IDEA will perform better, a metric based on the LGDS properties is provided. This metric is validated with numerical results across a variety of randomly generated environments.

A Control Theory inspired Exploration Method for a Linear Bandit driven by a Linear Gaussian Dynamical System

TL;DR

The paper addresses exploration-exploitation in a linear bandit where rewards arise from a known Linear Gaussian Dynamical System. It analyzes computational intractability of optimal control and derives regret lower bounds, motivating two practical algorithms: Kalman-UCB and IDEA. Kalman-UCB blends reward prediction with confidence-based perturbations, while IDEA prioritizes actions that minimize Kalman-filter state error, and a metric based on observability and perturbation distributions predicts relative performance. Numerical results across diverse LGDS settings show IDEA often achieving lower regret, particularly in environments with informative feedback dynamics, suggesting a practical advantage for LGDS-driven hyperparameter optimization and similar tasks.

Abstract

The paper introduces a linear bandit environment where the reward is the output of a known Linear Gaussian Dynamical System (LGDS). In this environment, we address the fundamental challenge of balancing exploration -- gathering information about the environment -- and exploitation -- selecting to the action with the highest predicted reward. We propose two algorithms, Kalman filter Upper Confidence Bound (Kalman-UCB) and Information filter Directed Exploration Action-selection (IDEA). Kalman-UCB uses the principle of optimism in the face of uncertainty. IDEA selects actions that maximize the combination of the predicted reward and a term that quantifies how much an action minimizes the error of the Kalman filter state prediction, which depends on the LGDS property called observability. IDEA is motivated by applications such as hyperparameter optimization in machine learning. A major problem encountered in hyperparameter optimization is the large action spaces, which hinder the performance of methods inspired by principle of optimism in the face of uncertainty as they need to explore each action to lower reward prediction uncertainty. To predict if either Kalman-UCB or IDEA will perform better, a metric based on the LGDS properties is provided. This metric is validated with numerical results across a variety of randomly generated environments.

Paper Structure

This paper contains 18 sections, 10 theorems, 83 equations, 3 figures, 2 tables, 5 algorithms.

Key Result

Lemma 1

The following facts are true for the Kalman filter eq:Kalman_Filter:

Figures (3)

  • Figure 1: Scatter plot of the normalized regret values of Kalman-UCB versus IDEA. Note that the normalized regret value is the percentage of each algorithm's regret with respect to the Kalman Oracle Action-selection's regret.
  • Figure 2: Scatter plot of Kalman-UCB's and IDEA's intervals \ref{['eq:interval_performance']}. Blue dots are the upper bound and red dots are the lower bounds.
  • Figure 3: Box plot of each method's normalized regret. Each subplot is a different perturbation magnitude level.

Theorems & Definitions (21)

  • Remark 1
  • Remark 2
  • Lemma 1
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Lemma 2
  • proof
  • Theorem 3
  • ...and 11 more