Table of Contents
Fetching ...

Safe Offline Reinforcement Learning with Real-Time Budget Constraints

Qian Lin, Bo Tang, Zifan Wu, Chao Yu, Shangqin Mao, Qianlong Xie, Xingxing Wang, Dong Wang

TL;DR

Trajectory-based REal-time Budget Inference (TREBI) is proposed as a novel solution that models this problem from the perspective of trajectory distribution and solves it through diffusion model planning and proves an error bound of the estimation on the episodic reward and cost under the offline setting and thus provides a performance guarantee for TREBI.

Abstract

Aiming at promoting the safe real-world deployment of Reinforcement Learning (RL), research on safe RL has made significant progress in recent years. However, most existing works in the literature still focus on the online setting where risky violations of the safety budget are likely to be incurred during training. Besides, in many real-world applications, the learned policy is required to respond to dynamically determined safety budgets (i.e., constraint threshold) in real time. In this paper, we target at the above real-time budget constraint problem under the offline setting, and propose Trajectory-based REal-time Budget Inference (TREBI) as a novel solution that models this problem from the perspective of trajectory distribution and solves it through diffusion model planning. Theoretically, we prove an error bound of the estimation on the episodic reward and cost under the offline setting and thus provide a performance guarantee for TREBI. Empirical results on a wide range of simulation tasks and a real-world large-scale advertising application demonstrate the capability of TREBI in solving real-time budget constraint problems under offline settings.

Safe Offline Reinforcement Learning with Real-Time Budget Constraints

TL;DR

Trajectory-based REal-time Budget Inference (TREBI) is proposed as a novel solution that models this problem from the perspective of trajectory distribution and solves it through diffusion model planning and proves an error bound of the estimation on the episodic reward and cost under the offline setting and thus provides a performance guarantee for TREBI.

Abstract

Aiming at promoting the safe real-world deployment of Reinforcement Learning (RL), research on safe RL has made significant progress in recent years. However, most existing works in the literature still focus on the online setting where risky violations of the safety budget are likely to be incurred during training. Besides, in many real-world applications, the learned policy is required to respond to dynamically determined safety budgets (i.e., constraint threshold) in real time. In this paper, we target at the above real-time budget constraint problem under the offline setting, and propose Trajectory-based REal-time Budget Inference (TREBI) as a novel solution that models this problem from the perspective of trajectory distribution and solves it through diffusion model planning. Theoretically, we prove an error bound of the estimation on the episodic reward and cost under the offline setting and thus provide a performance guarantee for TREBI. Empirical results on a wide range of simulation tasks and a real-world large-scale advertising application demonstrate the capability of TREBI in solving real-time budget constraint problems under offline settings.
Paper Structure (38 sections, 9 theorems, 63 equations, 11 figures, 2 tables, 1 algorithm)

This paper contains 38 sections, 9 theorems, 63 equations, 11 figures, 2 tables, 1 algorithm.

Key Result

Theorem 4.1

If the trajectory distribution of the behavior policy satisfies: then the optimal trajectory distribution for problem eq:offlinerl_ojeq:offlinerl_safe_conseq:offlinerl_traj_cons exists and takes the following form: where $\alpha$ is a constant depending on $\epsilon$ and $b$, and $Z=\int_{C(\tau)\leq b}p_{\pi_\beta}(\tau)\exp(\alpha R(\tau))d\tau$ is a constant normalizer to make sure that $q^*_

Figures (11)

  • Figure 1: The results on Pendulum and Reacher with regard to the normalized episodic costs (the top row) and the episodic rewards (the bottom row), under the constraints w.r.t. five different budgets (bars with different colors). The constraint w.r.t. a budget is satisfied only when the value of the normalized cost is less than one. The dashed lines in the cost plot and the reward plot indicate the normalized episodic cost threshold and the expected episodic reward of the converged unconstrained policy trained online, respectively. Note that for the convenience of display, some of the boxes are not completely shown in the figure, e.g., the reward of BCQ-Sauté on Reacher with budget $0.2$.
  • Figure 2: Results on HalfCheetah using three different types of datasets. The black dashed lines and the blue dashed lines in the reward plots indicate the expected episodic rewards of the unconstrained Diffuser and BCQ respectively.
  • Figure 3: The results on SafetyCarCircle and SafetyBallReach. The dashed lines in reward plots indicate the expected episodic reward of the behavior policy in the final iteration of online training.
  • Figure 4: Ablation studies on Pendulum.
  • Figure 5: Results on Hopper tasks.
  • ...and 6 more figures

Theorems & Definitions (17)

  • Theorem 4.1
  • proof
  • Lemma 4.2
  • proof
  • Proposition 4.3
  • proof
  • Theorem 1.1: 4.1 in the main paper
  • Lemma 1.2
  • proof
  • proof
  • ...and 7 more