Table of Contents
Fetching ...

Mildly Conservative Q-Learning for Offline Reinforcement Learning

Jiafei Lyu, Xiaoteng Ma, Xiu Li, Zongqing Lu

TL;DR

This paper proposes Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values and theoretically shows that MCQ induces a policy that behaves at least as well as the behavior policy and no erroneous overestimation will occur for OODactions.

Abstract

Offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment. The distribution shift between the learned policy and the behavior policy makes it necessary for the value function to stay conservative such that out-of-distribution (OOD) actions will not be severely overestimated. However, existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic, which suppresses the generalization of the value function and hinders the performance improvement. This paper explores mild but enough conservatism for offline learning while not harming generalization. We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values. We theoretically show that MCQ induces a policy that behaves at least as well as the behavior policy and no erroneous overestimation will occur for OOD actions. Experimental results on the D4RL benchmarks demonstrate that MCQ achieves remarkable performance compared with prior work. Furthermore, MCQ shows superior generalization ability when transferring from offline to online, and significantly outperforms baselines. Our code is publicly available at https://github.com/dmksjfl/MCQ.

Mildly Conservative Q-Learning for Offline Reinforcement Learning

TL;DR

This paper proposes Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values and theoretically shows that MCQ induces a policy that behaves at least as well as the behavior policy and no erroneous overestimation will occur for OODactions.

Abstract

Offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment. The distribution shift between the learned policy and the behavior policy makes it necessary for the value function to stay conservative such that out-of-distribution (OOD) actions will not be severely overestimated. However, existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic, which suppresses the generalization of the value function and hinders the performance improvement. This paper explores mild but enough conservatism for offline learning while not harming generalization. We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values. We theoretically show that MCQ induces a policy that behaves at least as well as the behavior policy and no erroneous overestimation will occur for OOD actions. Experimental results on the D4RL benchmarks demonstrate that MCQ achieves remarkable performance compared with prior work. Furthermore, MCQ shows superior generalization ability when transferring from offline to online, and significantly outperforms baselines. Our code is publicly available at https://github.com/dmksjfl/MCQ.
Paper Structure (22 sections, 10 theorems, 43 equations, 6 figures, 10 tables, 2 algorithms)

This paper contains 22 sections, 10 theorems, 43 equations, 6 figures, 10 tables, 2 algorithms.

Key Result

Proposition 1

In the support region of the behavior policy, i.e., ${\rm Support}(\mu)$, the MCB operator is a $\gamma$-contraction operator in the $\mathcal{L}_\infty$ norm, and any initial $Q$ function can converge to a unique fixed point by repeatedly applying $\mathcal{T}_{\mathrm{MCB}}$.

Figures (6)

  • Figure 1: Comparison of prior methods against mild conservatism. The red spots represent the dataset samples. The left figure shows that penalizing OOD actions makes the value function drop sharply at the boundary of the dataset's support, which barriers policy learning. The central figure depicts that policy regularization keeps the policy near behavior policy, leading to undesired performance if the behavior policy is unsatisfying. On the right side, we illustrate the basic idea of mild conservatism. The estimated values for OOD actions are allowed to be high as long as it does not affect the learning for the optimal policy supported by the dataset, i.e., $Q(s,a^{\rm ood})< \max_{a\in{\rm Support}(\mu)}Q(s,a)$.
  • Figure 2: Parameter study and $Q$ function estimation on halfcheetah-medium-v2 and hopper-medium-replay-v2. The shaded region captures the standard deviation.
  • Figure 3: Offline-to-online fine-tuning results on 6 D4RL MuJoCo locomotion tasks.
  • Figure 4: MuJoCo datasets. We conduct experiments on halfcheetah, hopper, and walker2d tasks.
  • Figure 5: Missing value estimation on halfcheetah-medium-v2.
  • ...and 1 more figures

Theorems & Definitions (20)

  • Definition 1
  • Proposition 1
  • Proposition 2: Behave at least as well as behavior policy
  • Proposition 3: Milder Pessimism
  • Definition 2
  • Proposition 4
  • Proposition 5: No erroneous overestimation will occur
  • Definition 3
  • Proposition 6
  • proof
  • ...and 10 more