Table of Contents
Fetching ...

Doubly Mild Generalization for Offline Reinforcement Learning

Yixiu Mao, Qi Wang, Yun Qu, Yuhang Jiang, Xiangyang Ji

TL;DR

It is shown that mild generalization beyond the dataset can be trusted and leveraged to improve performance under certain conditions, and Doubly Mild Generalization (DMG) is proposed, comprising mild action generalization and mild generalization propagation.

Abstract

Offline Reinforcement Learning (RL) suffers from the extrapolation error and value overestimation. From a generalization perspective, this issue can be attributed to the over-generalization of value functions or policies towards out-of-distribution (OOD) actions. Significant efforts have been devoted to mitigating such generalization, and recent in-sample learning approaches have further succeeded in entirely eschewing it. Nevertheless, we show that mild generalization beyond the dataset can be trusted and leveraged to improve performance under certain conditions. To appropriately exploit generalization in offline RL, we propose Doubly Mild Generalization (DMG), comprising (i) mild action generalization and (ii) mild generalization propagation. The former refers to selecting actions in a close neighborhood of the dataset to maximize the Q values. Even so, the potential erroneous generalization can still be propagated, accumulated, and exacerbated by bootstrapping. In light of this, the latter concept is introduced to mitigate the generalization propagation without impeding the propagation of RL learning signals. Theoretically, DMG guarantees better performance than the in-sample optimal policy in the oracle generalization scenario. Even under worst-case generalization, DMG can still control value overestimation at a certain level and lower bound the performance. Empirically, DMG achieves state-of-the-art performance across Gym-MuJoCo locomotion tasks and challenging AntMaze tasks. Moreover, benefiting from its flexibility in both generalization aspects, DMG enjoys a seamless transition from offline to online learning and attains strong online fine-tuning performance.

Doubly Mild Generalization for Offline Reinforcement Learning

TL;DR

It is shown that mild generalization beyond the dataset can be trusted and leveraged to improve performance under certain conditions, and Doubly Mild Generalization (DMG) is proposed, comprising mild action generalization and mild generalization propagation.

Abstract

Offline Reinforcement Learning (RL) suffers from the extrapolation error and value overestimation. From a generalization perspective, this issue can be attributed to the over-generalization of value functions or policies towards out-of-distribution (OOD) actions. Significant efforts have been devoted to mitigating such generalization, and recent in-sample learning approaches have further succeeded in entirely eschewing it. Nevertheless, we show that mild generalization beyond the dataset can be trusted and leveraged to improve performance under certain conditions. To appropriately exploit generalization in offline RL, we propose Doubly Mild Generalization (DMG), comprising (i) mild action generalization and (ii) mild generalization propagation. The former refers to selecting actions in a close neighborhood of the dataset to maximize the Q values. Even so, the potential erroneous generalization can still be propagated, accumulated, and exacerbated by bootstrapping. In light of this, the latter concept is introduced to mitigate the generalization propagation without impeding the propagation of RL learning signals. Theoretically, DMG guarantees better performance than the in-sample optimal policy in the oracle generalization scenario. Even under worst-case generalization, DMG can still control value overestimation at a certain level and lower bound the performance. Empirically, DMG achieves state-of-the-art performance across Gym-MuJoCo locomotion tasks and challenging AntMaze tasks. Moreover, benefiting from its flexibility in both generalization aspects, DMG enjoys a seamless transition from offline to online learning and attains strong online fine-tuning performance.

Paper Structure

This paper contains 44 sections, 19 theorems, 107 equations, 7 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Under certain continuity conditions, the following equation holds when the learning rate $\alpha$ is sufficiently small and $\tilde{a}$ is sufficiently close to $a$: where $C_1 \in [0,1]$ and $C_2$ is a bounded constant.

Figures (7)

  • Figure 1: Performance and Q values of DMG with varying mixture coefficient $\lambda$ over 5 random seeds. The crosses $\times$ mean that the value functions diverge in several seeds. As $\lambda$ increases, DMG enables stronger generalization propagation, resulting in higher and probably divergent learned Q values. Mild generalization propagation plays a crucial role in achieving strong performance.
  • Figure 2: Performance and Q values of DMG with varying penalty coefficient $\nu$ over 5 random seeds. As $\nu$ decreases, DMG allows broader action generalization, leading to larger learned Q values. Mild action generalization is also critical for attaining superior performance.
  • Figure 3: Runtime of algorithms on halfcheetah-medium-replay-v2 on a GeForce RTX 3090.
  • Figure 4: Learning curves of DMG on Gym locomotion tasks during offline training. The curves are averaged over 5 random seeds, with the shaded area representing the standard deviation across seeds.
  • Figure 5: Learning curves of DMG on Antmaze tasks during offline training. The curves are averaged over 5 random seeds, with the shaded area representing the standard deviation across seeds.
  • ...and 2 more figures

Theorems & Definitions (38)

  • Theorem 1: Informal
  • Definition 1: Mildly generalized policy
  • Definition 2
  • Definition 3
  • Lemma 1
  • Theorem 2: Contraction
  • Theorem 3: Performance
  • Theorem 4: Limited overestimation
  • Theorem 5: Performance lower bound
  • Lemma 2
  • ...and 28 more