Low-rank Matrix Bandits with Heavy-tailed Rewards
Yue Kang, Cho-Jui Hsieh, Thomas C. M. Lee
TL;DR
This work addresses stochastic low-rank matrix bandits under heavy-tailed rewards, where rewards satisfy $\mathbb{E}(|\eta_t|^{1+\delta})\le c$ with $\delta\in(0,1]$, and the reward is $y_t=\langle X_t,\Theta^*\rangle + \eta_t$ for a rank-$r$ parameter $\Theta^*$. It proposes LOTUS, a batched algorithm that combines a robust nuclear-norm, Hubber regression estimator with a LowTO exploitation step after subspace rotation, achieving regret $\tilde{O}(d^{3/2} r^{1/2} T^{1/(1+\delta)}/\tilde{D}_{rr})$ (when $T$ is unknown) and a near-optimal lower bound of $\Omega(d^{\delta/(1+\delta)} r^{\delta/(1+\delta)} T^{1/(1+\delta)})$. The paper also provides a rank-agnostic variant with an additional term $\tilde{O}(d r^{3/2} T^{(1+\delta)/(1+2\delta)})$, and proves lower bounds matching the $T$-dependence, up to logarithms. Theoretical results are complemented by experiments showing LOTUS's robustness to heavier-tailed noise compared to baselines like LowESTR.
Abstract
In stochastic low-rank matrix bandit, the expected reward of an arm is equal to the inner product between its feature matrix and some unknown $d_1$ by $d_2$ low-rank parameter matrix $Θ^*$ with rank $r \ll d_1\wedge d_2$. While all prior studies assume the payoffs are mixed with sub-Gaussian noises, in this work we loosen this strict assumption and consider the new problem of \underline{low}-rank matrix bandit with \underline{h}eavy-\underline{t}ailed \underline{r}ewards (LowHTR), where the rewards only have finite $(1+δ)$ moment for some $δ\in (0,1]$. By utilizing the truncation on observed payoffs and the dynamic exploration, we propose a novel algorithm called LOTUS attaining the regret bound of order $\tilde O(d^\frac{3}{2}r^\frac{1}{2}T^\frac{1}{1+δ}/\tilde{D}_{rr})$ without knowing $T$, which matches the state-of-the-art regret bound under sub-Gaussian noises~\citep{lu2021low,kang2022efficient} with $δ= 1$. Moreover, we establish a lower bound of the order $Ω(d^\fracδ{1+δ} r^\fracδ{1+δ} T^\frac{1}{1+δ}) = Ω(T^\frac{1}{1+δ})$ for LowHTR, which indicates our LOTUS is nearly optimal in the order of $T$. In addition, we improve LOTUS so that it does not require knowledge of the rank $r$ with $\tilde O(dr^\frac{3}{2}T^\frac{1+δ}{1+2δ})$ regret bound, and it is efficient under the high-dimensional scenario. We also conduct simulations to demonstrate the practical superiority of our algorithm.
