Table of Contents
Fetching ...

An Optimal Discriminator Weighted Imitation Perspective for Reinforcement Learning

Haoran Xu, Shuozhe Li, Harshit Sikchi, Scott Niekum, Amy Zhang

TL;DR

Offline RL faces distributional shift between the policy visitation distribution $d^{\pi}$ and the offline data distribution $d^{\mathcal{D}}$; this work reframes the problem as optimal discriminator-weighted imitation and proposes Iterative Dual-RL (IDRL). IDRL learns the visitation ratio via a two-stage correction that first estimates the action-distribution ratio $w^{*}(a|s)$ and then recovers the true state-action ratio $w^{*}(s,a)$ through off-policy evaluation, followed by iterative self-distillation over dataset supports. The method yields monotonic improvements and achieves state-of-the-art performance on D4RL benchmarks and corrupted demonstrations, with better stability than existing Primal-RL and Dual-RL baselines. The work demonstrates that dataset refinement guided by the optimal discriminator weight can effectively bridge offline RL and imitation learning in a principled, in-sample way.

Abstract

We introduce Iterative Dual Reinforcement Learning (IDRL), a new method that takes an optimal discriminator-weighted imitation view of solving RL. Our method is motivated by a simple experiment in which we find training a discriminator using the offline dataset plus an additional expert dataset and then performing discriminator-weighted behavior cloning gives strong results on various types of datasets. That optimal discriminator weight is quite similar to the learned visitation distribution ratio in Dual-RL, however, we find that current Dual-RL methods do not correctly estimate that ratio. In IDRL, we propose a correction method to iteratively approach the optimal visitation distribution ratio in the offline dataset given no addtional expert dataset. During each iteration, IDRL removes zero-weight suboptimal transitions using the learned ratio from the previous iteration and runs Dual-RL on the remaining subdataset. This can be seen as replacing the behavior visitation distribution with the optimized visitation distribution from the previous iteration, which theoretically gives a curriculum of improved visitation distribution ratios that are closer to the optimal discriminator weight. We verify the effectiveness of IDRL on various kinds of offline datasets, including D4RL datasets and more realistic corrupted demonstrations. IDRL beats strong Primal-RL and Dual-RL baselines in terms of both performance and stability, on all datasets.

An Optimal Discriminator Weighted Imitation Perspective for Reinforcement Learning

TL;DR

Offline RL faces distributional shift between the policy visitation distribution and the offline data distribution ; this work reframes the problem as optimal discriminator-weighted imitation and proposes Iterative Dual-RL (IDRL). IDRL learns the visitation ratio via a two-stage correction that first estimates the action-distribution ratio and then recovers the true state-action ratio through off-policy evaluation, followed by iterative self-distillation over dataset supports. The method yields monotonic improvements and achieves state-of-the-art performance on D4RL benchmarks and corrupted demonstrations, with better stability than existing Primal-RL and Dual-RL baselines. The work demonstrates that dataset refinement guided by the optimal discriminator weight can effectively bridge offline RL and imitation learning in a principled, in-sample way.

Abstract

We introduce Iterative Dual Reinforcement Learning (IDRL), a new method that takes an optimal discriminator-weighted imitation view of solving RL. Our method is motivated by a simple experiment in which we find training a discriminator using the offline dataset plus an additional expert dataset and then performing discriminator-weighted behavior cloning gives strong results on various types of datasets. That optimal discriminator weight is quite similar to the learned visitation distribution ratio in Dual-RL, however, we find that current Dual-RL methods do not correctly estimate that ratio. In IDRL, we propose a correction method to iteratively approach the optimal visitation distribution ratio in the offline dataset given no addtional expert dataset. During each iteration, IDRL removes zero-weight suboptimal transitions using the learned ratio from the previous iteration and runs Dual-RL on the remaining subdataset. This can be seen as replacing the behavior visitation distribution with the optimized visitation distribution from the previous iteration, which theoretically gives a curriculum of improved visitation distribution ratios that are closer to the optimal discriminator weight. We verify the effectiveness of IDRL on various kinds of offline datasets, including D4RL datasets and more realistic corrupted demonstrations. IDRL beats strong Primal-RL and Dual-RL baselines in terms of both performance and stability, on all datasets.

Paper Structure

This paper contains 28 sections, 6 theorems, 42 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1

Semi-gradient Dual-RL only learns $w^{*}(a|s) = \frac{\pi^\ast(a|s)}{\mu(a|s)}$ instead of $w^{*}(s,a) = \frac{d^\ast(s,a)}{d^{\mathcal{D}}(s,a)}$.

Figures (6)

  • Figure 1: (a) Illustration of our proposed IDRL framework. IDRL breaks the regularization barrier by performing imitation learning on a iteratively-refined dataset, it can solve hard tasks where previous behavior-regularized offline RL can not do. For example, previous methods will fail at finding the shortest path from blue point to red point while crossing yellow point, due to non-uniform data coverage at different state. (b) Mean scores of optimal discriminator-weighted behavior cloning (Optimal-DWBC) on D4RL Mujoco-{m,m-r,m-e}, Antmaze-{all} and Kitchen-{all} datasets. We train a discriminator $d$ on offline dataset $\mathcal{D}$ and an additional expert dataset $\mathcal{D}_E$, we then use $w_d(s,a)=\frac{d^{E}(s,a)}{d^{\mathcal{D}}(s,a)} = \frac{d(s,a)}{1 - d(s,a)}$ to do weighted-BC on $\mathcal{D}$ if $w_d(s,a) > \delta$. We compare it with SOTA Primal-RL method ReBRAC tarasov2024revisiting and SOTA Dual-RL method ODICE mao2024odice.
  • Figure 2: IDRL in a grid-world domain. The initial state (blue) and the goal state (red) define the task, with the green arrows representing remaining transitions from the dataset. The opacity of the green arrows denotes magnitude of the weights from the respective distribution ratios. Red arrows depict the trajectories generated by policies obtained through weighted-BC with the estimated distribution ratios. (a) Original dataset. (b) and (d) show the results after filtering the dataset based on the learned policy ratio (Uncorrected Dual-RL visitation ratio) $w(a|s)$ in the first and second iterations, respectively. (c) and (e) demonstrate the subsequent filtering using the state-action visitation distribution ratio $w(s, a)$, which is computed by combining $w(a|s)$ and $w(s)$ (IDRL correction). This process reveals that the method progressively focuses on the most relevant transitions, enabling the recovery of a near-optimal visitation distribution ratio after 2 iterations.
  • Figure 3: Performance of naive BC, Advantage-Weighted (AW), Density-Weighted initialized with AW (DW+AW) and IDRL on mixed datasets created by combining varying percentages of expert transitions with random transitions from D4RL Mujoco datasets. For IDRL, this table shows the results using three iterations.
  • Figure 4: Learning curves of IDRL on D4RL Mujoco locomotion datasets.
  • Figure 5: Learning curves of IDRL on D4RL Antmaze datasets.
  • ...and 1 more figures

Theorems & Definitions (11)

  • Proposition 1
  • Theorem 1
  • Lemma 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • proof
  • proof
  • proof
  • proof
  • ...and 1 more