Table of Contents
Fetching ...

Bandit Optimal Transport

Lorenzo Croissant

TL;DR

This work introduces Bandit Optimal Transport (BOT), a framework for online, bandit-feedback learning of OT problems with known marginals and unknown cost. It develops a phase-space representation that recasts measure-valued actions into a Hilbert-space linear bandit, and combines infinite-dimensional least-squares estimation with entropy-regularized optimism to derive regret bounds. The results show sublinear regret, with rates that interpolate between $\tilde{\mathcal{O}}(\sqrt{T})$ and $\tilde{\mathcal{O}}(T)$ depending on the regularity of the true cost function $c^*$. The approach leverages intrinsic OT regularity to control infinite-dimensional quantities, and offers a path toward practical online OT learning with extensions to Monge-type problems and RKHS-based refinements, impacting sequential decision-making in domains where transport costs are learned over time.

Abstract

Despite the impressive progress in statistical Optimal Transport (OT) in recent years, there has been little interest in the study of the \emph{sequential learning} of OT. Surprisingly so, as this problem is both practically motivated and a challenging extension of existing settings such as linear bandits. This article considers (for the first time) the stochastic bandit problem of learning to solve generic Kantorovich and entropic OT problems from repeated interactions when the marginals are known but the cost is unknown. We provide $\tilde{\mathcal O}(\sqrt{T})$ regret algorithms for both problems by extending linear bandits on Hilbert spaces. These results provide a reduction to infinite-dimensional linear bandits. To deal with the dimension, we provide a method to exploit the intrinsic regularity of the cost to learn, yielding corresponding regret bounds which interpolate between $\tilde{\mathcal O}(\sqrt{T})$ and $\tilde{\mathcal O}(T)$.

Bandit Optimal Transport

TL;DR

This work introduces Bandit Optimal Transport (BOT), a framework for online, bandit-feedback learning of OT problems with known marginals and unknown cost. It develops a phase-space representation that recasts measure-valued actions into a Hilbert-space linear bandit, and combines infinite-dimensional least-squares estimation with entropy-regularized optimism to derive regret bounds. The results show sublinear regret, with rates that interpolate between and depending on the regularity of the true cost function . The approach leverages intrinsic OT regularity to control infinite-dimensional quantities, and offers a path toward practical online OT learning with extensions to Monge-type problems and RKHS-based refinements, impacting sequential decision-making in domains where transport costs are learned over time.

Abstract

Despite the impressive progress in statistical Optimal Transport (OT) in recent years, there has been little interest in the study of the \emph{sequential learning} of OT. Surprisingly so, as this problem is both practically motivated and a challenging extension of existing settings such as linear bandits. This article considers (for the first time) the stochastic bandit problem of learning to solve generic Kantorovich and entropic OT problems from repeated interactions when the marginals are known but the cost is unknown. We provide regret algorithms for both problems by extending linear bandits on Hilbert spaces. These results provide a reduction to infinite-dimensional linear bandits. To deal with the dimension, we provide a method to exploit the intrinsic regularity of the cost to learn, yielding corresponding regret bounds which interpolate between and .

Paper Structure

This paper contains 51 sections, 26 theorems, 125 equations, 2 algorithms.

Key Result

Theorem 5.1

Under asmp: L2 caseasmp: estimate + subG, for any $\varepsilon>0$, $\delta>0$, $\lambda>0$, and $T\in\mathbb{N}$, the regret of alg: alg shared with ${(\varepsilon_t)}_{t\in\mathbb{N}}={(\varepsilon)}_{t\in\mathbb{N}}$, denoted by $\mathcal{A}$, satisfies with probability at least $1-\delta$. Note that $M_T$ (thus also $\beta_T(\delta)$) depends implicitly on $\varepsilon$.

Theorems & Definitions (46)

  • Theorem 5.1
  • Theorem 5.2
  • Corollary 5.2: \ref{['cor: regret for fixed approximation order with bounded basis']}
  • Corollary 5.3: \ref{['thm: regret for varying approximation']}
  • Definition 1
  • Theorem B.1: constantin_fourier_2016
  • Theorem B.2: constantin_fourier_2016
  • Lemma B.3
  • proof
  • Lemma B.4
  • ...and 36 more