ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update

Liyuan Mao; Haoran Xu; Weinan Zhang; Xianyuan Zhan

ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update

Liyuan Mao, Haoran Xu, Weinan Zhang, Xianyuan Zhan

TL;DR

The paper reexamines Distribution Correction Estimation (DICE) methods for offline RL/IL and identifies a fundamental gradient-flow clash between forward (state-level) and backward (next-state) components of the Bellman residual. It proposes an orthogonal-gradient update that projects the backward gradient onto the normal plane of the forward gradient, yielding the O-DICE algorithm that augments V-DICE with an orthogonalized update. Theoretical results show interference-free optimization, guaranteed descent under suitable conditions, and reduced feature co-adaptation, while toy and large-scale experiments demonstrate state-action-level constraint benefits, improved robustness to OOD states, and SOTA performance on D4RL and offline IL. The work suggests that correcting the DICE objective with gradient projection yields practical gains and broader applicability, with future directions including online RL extensions and broader Bellman-objective adaptations.

Abstract

In this study, we investigate the DIstribution Correction Estimation (DICE) methods, an important line of work in offline reinforcement learning (RL) and imitation learning (IL). DICE-based methods impose state-action-level behavior constraint, which is an ideal choice for offline learning. However, they typically perform much worse than current state-of-the-art (SOTA) methods that solely use action-level behavior constraint. After revisiting DICE-based methods, we find there exist two gradient terms when learning the value function using true-gradient update: forward gradient (taken on the current state) and backward gradient (taken on the next state). Using forward gradient bears a large similarity to many offline RL methods, and thus can be regarded as applying action-level constraint. However, directly adding the backward gradient may degenerate or cancel out its effect if these two gradients have conflicting directions. To resolve this issue, we propose a simple yet effective modification that projects the backward gradient onto the normal plane of the forward gradient, resulting in an orthogonal-gradient update, a new learning rule for DICE-based methods. We conduct thorough theoretical analyses and find that the projected backward gradient brings state-level behavior regularization, which reveals the mystery of DICE-based methods: the value learning objective does try to impose state-action-level constraint, but needs to be used in a corrected way. Through toy examples and extensive experiments on complex offline RL and IL tasks, we demonstrate that DICE-based methods using orthogonal-gradient updates (O-DICE) achieve SOTA performance and great robustness.

ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update

TL;DR

Abstract

Paper Structure (20 sections, 4 theorems, 58 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 4 theorems, 58 equations, 5 figures, 3 tables, 1 algorithm.

Introduction
Mystery of Distribution Correction Estimation (DICE)
Preliminaries
Decompose Gradient Flow of DICE
Fix The Gap by Orthogonal-gradient Update
Analysis
Theoretic Results
A Toy Example
Experiments
Results on D4RL Benchmarks
Results on Policy Robustness
Results on Offline Imitation Learning
Related Work
Conclusions, Limitations and Future Work
A Recap of Distribution Correction Estimation
...and 5 more sections

Key Result

Theorem 1

In orthogonal-gradient update, $L_1^{\theta^{\prime\prime}}(s) - L_1^{\theta^{\prime}}(s) = 0$ (under first order approximation).

Figures (5)

Figure 1: Illustration of orthogonal-gradient update
Figure 2: Visualizations of the value function $V$ learned by using different updating rules in a grid world toycase, the base algorithm is V-DICE. We normalize the value of $V$ to $[0, 100]$ for fair comparison. The borderline of the offline dataset support is marked by white dashed lines. Orthogonal-gradient update better distinguishes both different in-distribution actions and in-distribution states vs. OOD states.
Figure 3: Left: percent difference of the worst episode during the 10 evaluation episodes at the last evaluation of different offline RL algorithms. Right: mean value of $\nabla_{\theta}V(s)^\top \nabla_{\theta}V(s^{\prime})$ over the dataset of S-DICE and O-DICE.
Figure 4: Learning curves of O-DICE and S-DICE on D4RL MuJoCo locomotion datasets.
Figure 5: Learning curves of O-DICE and S-DICE on D4RL AntMaze datasets.

Theorems & Definitions (10)

Theorem 1
Theorem 2
Theorem 3: Orthogonal-gradient update helps alleviate feature co-adaptation
Theorem 4: How feature co-adaptation affects state-level robustness
proof
proof
proof
proof
proof
proof

ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update

TL;DR

Abstract

ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (10)