Table of Contents
Fetching ...

ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update

Liyuan Mao, Haoran Xu, Weinan Zhang, Xianyuan Zhan

TL;DR

The paper reexamines Distribution Correction Estimation (DICE) methods for offline RL/IL and identifies a fundamental gradient-flow clash between forward (state-level) and backward (next-state) components of the Bellman residual. It proposes an orthogonal-gradient update that projects the backward gradient onto the normal plane of the forward gradient, yielding the O-DICE algorithm that augments V-DICE with an orthogonalized update. Theoretical results show interference-free optimization, guaranteed descent under suitable conditions, and reduced feature co-adaptation, while toy and large-scale experiments demonstrate state-action-level constraint benefits, improved robustness to OOD states, and SOTA performance on D4RL and offline IL. The work suggests that correcting the DICE objective with gradient projection yields practical gains and broader applicability, with future directions including online RL extensions and broader Bellman-objective adaptations.

Abstract

In this study, we investigate the DIstribution Correction Estimation (DICE) methods, an important line of work in offline reinforcement learning (RL) and imitation learning (IL). DICE-based methods impose state-action-level behavior constraint, which is an ideal choice for offline learning. However, they typically perform much worse than current state-of-the-art (SOTA) methods that solely use action-level behavior constraint. After revisiting DICE-based methods, we find there exist two gradient terms when learning the value function using true-gradient update: forward gradient (taken on the current state) and backward gradient (taken on the next state). Using forward gradient bears a large similarity to many offline RL methods, and thus can be regarded as applying action-level constraint. However, directly adding the backward gradient may degenerate or cancel out its effect if these two gradients have conflicting directions. To resolve this issue, we propose a simple yet effective modification that projects the backward gradient onto the normal plane of the forward gradient, resulting in an orthogonal-gradient update, a new learning rule for DICE-based methods. We conduct thorough theoretical analyses and find that the projected backward gradient brings state-level behavior regularization, which reveals the mystery of DICE-based methods: the value learning objective does try to impose state-action-level constraint, but needs to be used in a corrected way. Through toy examples and extensive experiments on complex offline RL and IL tasks, we demonstrate that DICE-based methods using orthogonal-gradient updates (O-DICE) achieve SOTA performance and great robustness.

ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update

TL;DR

The paper reexamines Distribution Correction Estimation (DICE) methods for offline RL/IL and identifies a fundamental gradient-flow clash between forward (state-level) and backward (next-state) components of the Bellman residual. It proposes an orthogonal-gradient update that projects the backward gradient onto the normal plane of the forward gradient, yielding the O-DICE algorithm that augments V-DICE with an orthogonalized update. Theoretical results show interference-free optimization, guaranteed descent under suitable conditions, and reduced feature co-adaptation, while toy and large-scale experiments demonstrate state-action-level constraint benefits, improved robustness to OOD states, and SOTA performance on D4RL and offline IL. The work suggests that correcting the DICE objective with gradient projection yields practical gains and broader applicability, with future directions including online RL extensions and broader Bellman-objective adaptations.

Abstract

In this study, we investigate the DIstribution Correction Estimation (DICE) methods, an important line of work in offline reinforcement learning (RL) and imitation learning (IL). DICE-based methods impose state-action-level behavior constraint, which is an ideal choice for offline learning. However, they typically perform much worse than current state-of-the-art (SOTA) methods that solely use action-level behavior constraint. After revisiting DICE-based methods, we find there exist two gradient terms when learning the value function using true-gradient update: forward gradient (taken on the current state) and backward gradient (taken on the next state). Using forward gradient bears a large similarity to many offline RL methods, and thus can be regarded as applying action-level constraint. However, directly adding the backward gradient may degenerate or cancel out its effect if these two gradients have conflicting directions. To resolve this issue, we propose a simple yet effective modification that projects the backward gradient onto the normal plane of the forward gradient, resulting in an orthogonal-gradient update, a new learning rule for DICE-based methods. We conduct thorough theoretical analyses and find that the projected backward gradient brings state-level behavior regularization, which reveals the mystery of DICE-based methods: the value learning objective does try to impose state-action-level constraint, but needs to be used in a corrected way. Through toy examples and extensive experiments on complex offline RL and IL tasks, we demonstrate that DICE-based methods using orthogonal-gradient updates (O-DICE) achieve SOTA performance and great robustness.
Paper Structure (20 sections, 4 theorems, 58 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 4 theorems, 58 equations, 5 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

In orthogonal-gradient update, $L_1^{\theta^{\prime\prime}}(s) - L_1^{\theta^{\prime}}(s) = 0$ (under first order approximation).

Figures (5)

  • Figure 1: Illustration of orthogonal-gradient update
  • Figure 2: Visualizations of the value function $V$ learned by using different updating rules in a grid world toycase, the base algorithm is V-DICE. We normalize the value of $V$ to $[0, 100]$ for fair comparison. The borderline of the offline dataset support is marked by white dashed lines. Orthogonal-gradient update better distinguishes both different in-distribution actions and in-distribution states vs. OOD states.
  • Figure 3: Left: percent difference of the worst episode during the 10 evaluation episodes at the last evaluation of different offline RL algorithms. Right: mean value of $\nabla_{\theta}V(s)^\top \nabla_{\theta}V(s^{\prime})$ over the dataset of S-DICE and O-DICE.
  • Figure 4: Learning curves of O-DICE and S-DICE on D4RL MuJoCo locomotion datasets.
  • Figure 5: Learning curves of O-DICE and S-DICE on D4RL AntMaze datasets.

Theorems & Definitions (10)

  • Theorem 1
  • Theorem 2
  • Theorem 3: Orthogonal-gradient update helps alleviate feature co-adaptation
  • Theorem 4: How feature co-adaptation affects state-level robustness
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof