Table of Contents
Fetching ...

Towards More Efficient, Robust, Instance-adaptive, and Generalizable Sequential Decision making

Zhiyong Wang

TL;DR

The primary goal of this Ph.D. study is to develop provably efficient and practical algorithms for data-driven sequential decision-making under uncertainty for both general reinforcement learning and bandits.

Abstract

The primary goal of my Ph.D. study is to develop provably efficient and practical algorithms for data-driven sequential decision-making under uncertainty. My work focuses on reinforcement learning (RL), multi-armed bandits, and their applications, including recommendation systems, computer networks, video analytics, and large language models (LLMs). Sequential decision-making methods, such as bandits and RL, have demonstrated remarkable success - ranging from outperforming human players in complex games like Atari and Go to advancing robotics, recommendation systems, and fine-tuning LLMs. Despite these successes, many established algorithms rely on idealized models that can fail under model misspecifications or adversarial perturbations, particularly in settings where accurate prior knowledge of the underlying model class is unavailable or where malicious users operate within dynamic systems. These challenges are pervasive in real-world applications, where robust and adaptive solutions are critical. Furthermore, while worst-case guarantees provide theoretical reliability, they often fail to capture instance-dependent performance, which can lead to more efficient and practical solutions. Another key challenge lies in generalizing to new, unseen environments, a crucial requirement for deploying these methods in dynamic and unpredictable settings. To address these limitations, my research aims to develop more efficient, robust, instance-adaptive, and generalizable sequential decision-making algorithms for both reinforcement learning and bandits. Towards this end, I focus on developing more efficient, robust, instance-adaptive, and generalizable for both general reinforcement learning (RL) and bandits.

Towards More Efficient, Robust, Instance-adaptive, and Generalizable Sequential Decision making

TL;DR

The primary goal of this Ph.D. study is to develop provably efficient and practical algorithms for data-driven sequential decision-making under uncertainty for both general reinforcement learning and bandits.

Abstract

The primary goal of my Ph.D. study is to develop provably efficient and practical algorithms for data-driven sequential decision-making under uncertainty. My work focuses on reinforcement learning (RL), multi-armed bandits, and their applications, including recommendation systems, computer networks, video analytics, and large language models (LLMs). Sequential decision-making methods, such as bandits and RL, have demonstrated remarkable success - ranging from outperforming human players in complex games like Atari and Go to advancing robotics, recommendation systems, and fine-tuning LLMs. Despite these successes, many established algorithms rely on idealized models that can fail under model misspecifications or adversarial perturbations, particularly in settings where accurate prior knowledge of the underlying model class is unavailable or where malicious users operate within dynamic systems. These challenges are pervasive in real-world applications, where robust and adaptive solutions are critical. Furthermore, while worst-case guarantees provide theoretical reliability, they often fail to capture instance-dependent performance, which can lead to more efficient and practical solutions. Another key challenge lies in generalizing to new, unseen environments, a crucial requirement for deploying these methods in dynamic and unpredictable settings. To address these limitations, my research aims to develop more efficient, robust, instance-adaptive, and generalizable sequential decision-making algorithms for both reinforcement learning and bandits. Towards this end, I focus on developing more efficient, robust, instance-adaptive, and generalizable for both general reinforcement learning (RL) and bandits.

Paper Structure

This paper contains 188 sections, 90 theorems, 531 equations, 13 figures, 4 tables, 15 algorithms.

Key Result

Lemma 1

For two distributions $f \in \Delta([0,1])$ and $g \in \Delta([0,1])$: where $\mathrm{VaR}_f := \mathbb{E}_{x\sim f} ( x - \mathbb{E}_{x\sim f}[x])^2$ denotes the variance of the distribution $f$.

Figures (13)

  • Figure 1: Two Contextual MDPs with the same compliant average MDPs. The discrete contextual space is defined as $C=\{v,w\}$ and both MDPs satisfies $\mathcal{S}=\{x_1\},\mathcal{A}=\{a_1,a_2,a_3\},H=1$. The data collection distributions $\mu$ and rewards $r$ for each action of each context are specified in the graph.
  • Figure 2: The figures compare RCLUMB and RSCLUMB with the baselines. (a) shows the result on synthetic data, (b) and (c) show the results on Yelp dataset, (d) and (e) show the results on Movielens dataset. All experiments are under the setting of $u = 1,000$ users, $m =10$ clusters, and $d=50$. All results are averaged under $10$ random trials. The error bars are standard deviations divided by $\sqrt{10}$.
  • Figure 3: Illustration of LOCUD. The unknown user relations are represented by dotted circles, e.g., user 3, 7 have similar preferences and thus can be in the same user segment (i.e., cluster). Users 6 and 8 are corrupted users with dynamic behaviors over time (e.g., for user 8, the behaviors are normal at $t_1$ and $t_3$ (blue), but are adversarially corrupted at $t_2$ and $t_4$ (red)lykouris2018stochastiche2022nearly), making them hard to be detected online. The agent needs to learn user relations to utilize information among similar users to speed up learning, and detect corrupted users 6, 8 online from bandit feedback.
  • Figure 4: Algorithm illustrations. Users 6 and 8 are corrupted users (orange), and the others are normal (green). (a) illustrates RCLUB-WCU, which starts with a complete user graph, and adaptively deletes edges between users (dashed lines) with dissimilar robustly learned preferences. The corrupted behaviors of users 6 and 8 may cause inaccurate preference estimations, leading to erroneous relation inference. In this case, how to delete edges correctly is non-trivial, and RCLUB-WCU addresses this challenge (detailed in Section \ref{['section: rclub-wcu']}). (b) illustrates OCCUD at some round $t$, where person icons with triangle hats represent the non-robust user preference estimations. The gap between the non-robust estimation of user 6 and the robust estimation of user 6's inferred cluster (circle $C_1$) exceeds the threshold $r_6$ at this round (Line \ref{['detect line']} in Algo.\ref{['occud']}), so OCCUD detects user 6 to be corrupted.
  • Figure 5: Recommendation results on the synthetic and real-world datasets
  • ...and 8 more figures

Theorems & Definitions (125)

  • Lemma 1: Lemma 4.3 in wang2024more
  • Definition 3.1: $\ell_p$ Eluder Dimension
  • Remark 1
  • Lemma 2: Proposition 19 in liu2022partially
  • Theorem 3.3.1: Main theorem for online setting
  • Corollary 3.1: Horizon-free and First-order regret bound
  • Corollary 3.2: $\log K$ regret bound with deterministic transitions
  • Definition 3.2: Bracketing Number geer2000empirical
  • Corollary 3.3: Regret bound for alg:mleonline with infinite model class $\mathcal{P}$
  • Example 1: Tabular MDPs
  • ...and 115 more