Table of Contents
Fetching ...

Low-rank Matrix Bandits with Heavy-tailed Rewards

Yue Kang, Cho-Jui Hsieh, Thomas C. M. Lee

TL;DR

This work addresses stochastic low-rank matrix bandits under heavy-tailed rewards, where rewards satisfy $\mathbb{E}(|\eta_t|^{1+\delta})\le c$ with $\delta\in(0,1]$, and the reward is $y_t=\langle X_t,\Theta^*\rangle + \eta_t$ for a rank-$r$ parameter $\Theta^*$. It proposes LOTUS, a batched algorithm that combines a robust nuclear-norm, Hubber regression estimator with a LowTO exploitation step after subspace rotation, achieving regret $\tilde{O}(d^{3/2} r^{1/2} T^{1/(1+\delta)}/\tilde{D}_{rr})$ (when $T$ is unknown) and a near-optimal lower bound of $\Omega(d^{\delta/(1+\delta)} r^{\delta/(1+\delta)} T^{1/(1+\delta)})$. The paper also provides a rank-agnostic variant with an additional term $\tilde{O}(d r^{3/2} T^{(1+\delta)/(1+2\delta)})$, and proves lower bounds matching the $T$-dependence, up to logarithms. Theoretical results are complemented by experiments showing LOTUS's robustness to heavier-tailed noise compared to baselines like LowESTR.

Abstract

In stochastic low-rank matrix bandit, the expected reward of an arm is equal to the inner product between its feature matrix and some unknown $d_1$ by $d_2$ low-rank parameter matrix $Θ^*$ with rank $r \ll d_1\wedge d_2$. While all prior studies assume the payoffs are mixed with sub-Gaussian noises, in this work we loosen this strict assumption and consider the new problem of \underline{low}-rank matrix bandit with \underline{h}eavy-\underline{t}ailed \underline{r}ewards (LowHTR), where the rewards only have finite $(1+δ)$ moment for some $δ\in (0,1]$. By utilizing the truncation on observed payoffs and the dynamic exploration, we propose a novel algorithm called LOTUS attaining the regret bound of order $\tilde O(d^\frac{3}{2}r^\frac{1}{2}T^\frac{1}{1+δ}/\tilde{D}_{rr})$ without knowing $T$, which matches the state-of-the-art regret bound under sub-Gaussian noises~\citep{lu2021low,kang2022efficient} with $δ= 1$. Moreover, we establish a lower bound of the order $Ω(d^\fracδ{1+δ} r^\fracδ{1+δ} T^\frac{1}{1+δ}) = Ω(T^\frac{1}{1+δ})$ for LowHTR, which indicates our LOTUS is nearly optimal in the order of $T$. In addition, we improve LOTUS so that it does not require knowledge of the rank $r$ with $\tilde O(dr^\frac{3}{2}T^\frac{1+δ}{1+2δ})$ regret bound, and it is efficient under the high-dimensional scenario. We also conduct simulations to demonstrate the practical superiority of our algorithm.

Low-rank Matrix Bandits with Heavy-tailed Rewards

TL;DR

This work addresses stochastic low-rank matrix bandits under heavy-tailed rewards, where rewards satisfy $\mathbb{E}(|\eta_t|^{1+\delta})\le c$ with $\delta\in(0,1]$, and the reward is $y_t=\langle X_t,\Theta^*\rangle + \eta_t$ for a rank-$r$ parameter $\Theta^*$. It proposes LOTUS, a batched algorithm that combines a robust nuclear-norm, Hubber regression estimator with a LowTO exploitation step after subspace rotation, achieving regret $\tilde{O}(d^{3/2} r^{1/2} T^{1/(1+\delta)}/\tilde{D}_{rr})$ (when $T$ is unknown) and a near-optimal lower bound of $\Omega(d^{\delta/(1+\delta)} r^{\delta/(1+\delta)} T^{1/(1+\delta)})$. The paper also provides a rank-agnostic variant with an additional term $\tilde{O}(d r^{3/2} T^{(1+\delta)/(1+2\delta)})$, and proves lower bounds matching the $T$-dependence, up to logarithms. Theoretical results are complemented by experiments showing LOTUS's robustness to heavier-tailed noise compared to baselines like LowESTR.

Abstract

In stochastic low-rank matrix bandit, the expected reward of an arm is equal to the inner product between its feature matrix and some unknown by low-rank parameter matrix with rank . While all prior studies assume the payoffs are mixed with sub-Gaussian noises, in this work we loosen this strict assumption and consider the new problem of \underline{low}-rank matrix bandit with \underline{h}eavy-\underline{t}ailed \underline{r}ewards (LowHTR), where the rewards only have finite moment for some . By utilizing the truncation on observed payoffs and the dynamic exploration, we propose a novel algorithm called LOTUS attaining the regret bound of order without knowing , which matches the state-of-the-art regret bound under sub-Gaussian noises~\citep{lu2021low,kang2022efficient} with . Moreover, we establish a lower bound of the order for LowHTR, which indicates our LOTUS is nearly optimal in the order of . In addition, we improve LOTUS so that it does not require knowledge of the rank with regret bound, and it is efficient under the high-dimensional scenario. We also conduct simulations to demonstrate the practical superiority of our algorithm.
Paper Structure (24 sections, 13 theorems, 130 equations, 1 figure, 1 algorithm)

This paper contains 24 sections, 13 theorems, 130 equations, 1 figure, 1 algorithm.

Key Result

Theorem 4.1

By extending Assumption assu:subg with any order of $\sigma$ and $c_l$, With probability at least $1-\epsilon$, the low-rank estimator $\widehat{\Theta}$ in Eqn. eq:estimator with $\tau \asymp \left(n/(d+ \ln{(1/\epsilon)})\right)^{\frac{1}{1+\delta}} c^{\frac{1}{1+\delta}}$ and $\lambda \asymp \sig for some constant $C_1$ as long as we have $n \gtrsim dr\nu^3, d, \nu^2$, and $(d-\ln{(\epsilon)})\

Figures (1)

  • Figure 1: Plots of cumulative regrets of LowESTR and our proposed LOTUS with fixed or changing contextual arm set under t-distribution, Pareto, and Laplace heavy-tailed noise. We use the LOTUS algorithm agnostic to $r$ in the first three experiments displayed in the first row, and we utilize the value of $r$ in LOTUS in experiments shown in the second row.

Theorems & Definitions (14)

  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Theorem 4.4
  • Theorem 5.1
  • Lemma A.1
  • Corollary A.2
  • Definition A.3
  • Theorem A.4
  • Theorem A.5
  • ...and 4 more