MICRO: Model-Based Offline Reinforcement Learning with a Conservative Bellman Operator

Xiao-Yin Liu; Xiao-Hu Zhou; Guotao Li; Hao Li; Mei-Jiang Gui; Tian-Yu Xiang; De-Xing Huang; Zeng-Guang Hou

MICRO: Model-Based Offline Reinforcement Learning with a Conservative Bellman Operator

Xiao-Yin Liu, Xiao-Hu Zhou, Guotao Li, Hao Li, Mei-Jiang Gui, Tian-Yu Xiang, De-Xing Huang, Zeng-Guang Hou

TL;DR

MICRO tackles offline reinforcement learning under distribution shift by integrating robustness directly into model-based offline RL through a conservative Bellman operator. It uses an adaptive penalty f(s,a) learned from an ensemble of dynamics models to trade off performance and robustness, avoiding continual dynamics-model updates during policy learning. The method provides a dual reformulation to efficiently optimize over the uncertainty set and proves contraction properties with a robust policy-improvement guarantee. Empirically, MICRO achieves state-of-the-art or competitive results on D4RL tasks, with substantially reduced computation time and improved robustness to environment perturbations and certain adversarial attacks, while highlighting areas for further improvement under strong attacks.

Abstract

Offline reinforcement learning (RL) faces a significant challenge of distribution shift. Model-free offline RL penalizes the Q value for out-of-distribution (OOD) data or constrains the policy closed to the behavior policy to tackle this problem, but this inhibits the exploration of the OOD region. Model-based offline RL, which uses the trained environment model to generate more OOD data and performs conservative policy optimization within that model, has become an effective method for this problem. However, the current model-based algorithms rarely consider agent robustness when incorporating conservatism into policy. Therefore, the new model-based offline algorithm with a conservative Bellman operator (MICRO) is proposed. This method trades off performance and robustness via introducing the robust Bellman operator into the algorithm. Compared with previous model-based algorithms with robust adversarial models, MICRO can significantly reduce the computation cost by only choosing the minimal Q value in the state uncertainty set. Extensive experiments demonstrate that MICRO outperforms prior RL algorithms in offline RL benchmark and is considerably robust to adversarial perturbations.

MICRO: Model-Based Offline Reinforcement Learning with a Conservative Bellman Operator

TL;DR

Abstract

Paper Structure (24 sections, 7 theorems, 33 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 7 theorems, 33 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Problem Formulation
Methodology
Reformulation for Conservative Bellman Operator
Theoretical Analysis
Practical Implementation
Experiments
Benchmark Results (Q1 and Q2)
Environment Parameter Perturbations (Q3)
External Adversarial Attacks (Q3)
Conclusion
Proof of Proposition
Proof of Proposition \ref{['pro_2']}
...and 9 more sections

Key Result

Proposition 1

Let $\mathcal{M}^{\star}$ be the true MDP and $\mathcal{M}_{\varepsilon}$ be the uncertainty set of MDP. Then with high probability, for any policy $\pi$, $\inf_{\mathcal{M} \in \mathcal{M}_{\varepsilon}}V_{\mathcal{M}}^{\pi}\leq V_{\mathcal{M}^{\star}}^{\pi}$ holds.

Figures (5)

Figure 1: Conceptual illustration of model-based offline RL with a conservative Bellman operator.
Figure 2: The performance of MICRO, RAMBO and MOBILE under environment parameters perturbation in Halfcheetah, Hopper and Walker2d environments. The gravity and friction vary from 0.5 to 5 times and 0.5 to 1.5 times the value of the simulation environment, respectively.
Figure 3: The performance of MICRO, MOBILE and RAMBO under attack scales range [0, 0.2] of different attack types in Walker2d environment. M, M-R and M-E are the abbreviations of Medium, Medium-Replay and Medium-Expert, respectively.
Figure 4: Corresponding learning curves for training. Each figure shows the training curve for a specific task under different datasets.
Figure 5: The performance of MICRO, MOBILE and RAMBO under attack scales range [0, 0.2] of different attack types in three Walker2d datasets. M, M-R and M-E are the abbreviations of Medium, Medium-Replay and Medium-Expert, respectively.

Theorems & Definitions (9)

Proposition 1: Pessimistic value function
Definition 1
Proposition 2: $\gamma$-contraction mapping operator
Proposition 3
Definition 2: Policy concentrability coefficient
Theorem 1: Robust policy improvement
Lemma 1
Lemma 2
Lemma 3: Lemma 2 in liu2023domain

MICRO: Model-Based Offline Reinforcement Learning with a Conservative Bellman Operator

TL;DR

Abstract

MICRO: Model-Based Offline Reinforcement Learning with a Conservative Bellman Operator

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (9)