Table of Contents
Fetching ...

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Ofir Nachum, Yinlam Chow, Bo Dai, Lihong Li

TL;DR

DualDICE tackles offline reinforcement learning by estimating discounted stationary distribution corrections without knowledge of the data-collection policy. It derives a minimax objective via Fenchel duality and a change of variables, yielding corrections as Bellman residuals and enabling a practical, weight-free estimation framework. The approach comes with formal convergence guarantees and demonstrates improved off-policy evaluation accuracy, particularly in settings with function approximation or multiple unknown behavior policies. These properties make DualDICE well-suited for robust evaluation and training of policies from fixed datasets in diverse real-world domains.

Abstract

In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios -- correction terms which quantify the likelihood that the new policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset -- can improve accuracy and performance. In this work, we propose an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset. Furthermore, it eschews any direct use of importance weights, thus avoiding potential optimization instabilities endemic of previous methods. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to off-policy policy evaluation and find that our algorithm significantly improves accuracy compared to existing techniques.

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

TL;DR

DualDICE tackles offline reinforcement learning by estimating discounted stationary distribution corrections without knowledge of the data-collection policy. It derives a minimax objective via Fenchel duality and a change of variables, yielding corrections as Bellman residuals and enabling a practical, weight-free estimation framework. The approach comes with formal convergence guarantees and demonstrates improved off-policy evaluation accuracy, particularly in settings with function approximation or multiple unknown behavior policies. These properties make DualDICE well-suited for robust evaluation and training of policies from fixed datasets in diverse real-world domains.

Abstract

In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios -- correction terms which quantify the likelihood that the new policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset -- can improve accuracy and performance. In this work, we propose an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset. Furthermore, it eschews any direct use of importance weights, thus avoiding potential optimization instabilities endemic of previous methods. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to off-policy policy evaluation and find that our algorithm significantly improves accuracy compared to existing techniques.

Paper Structure

This paper contains 36 sections, 6 theorems, 75 equations, 5 figures, 1 algorithm.

Key Result

Theorem 2

(Informal) Under some mild assumptions, the mean squared error (MSE) associated with using $\hat{\nu},\hat{\zeta}$ for OPE can be bounded as, where the outer expectation is with respect to the randomness of the empirical samples and $OPT$, $\epsilon_{opt}$ denotes the optimization error, and ${\epsilon}_{approx}\left(\mathcal{F}, \mathcal{H}\right)$ denotes the approximation error due to $\mathca

Figures (5)

  • Figure 1: We perform OPE on the Taxi domain dietterich2000hierarchical. The plots show log RMSE of the estimator across different numbers of trajectories and different trajectory lengths ($x$-axis). For this domain, we avoid any potential issues in optimization by solving for the optimum of the objectives exactly using standard matrix operations. Thus, we are able to see that our method and the TD method are competitive with each other.
  • Figure 2: We perform OPE on control tasks. Each plot shows the estimated average step reward over training (x-axis is training step) and different behavior policies (higher $\alpha$ corresponds to a behavior policy closer to the target policy). We find that in all cases, our method is able to approximate these desired values well, with accuracy improving with a larger $\alpha$. On the other hand, the TD method performs poorly, even more so when the behavior policy $\mu$ is unknown and must be estimated. While on Cartpole it can start to approach the desired value for large $\alpha$, on the more complicated Reacher task (which involves continuous actions) its learning is too unstable to learn anything at all.
  • Figure 3: We compare the OPE error when using different forms of $f$ to estimate stationary distribution ratios with function approximation, which are then applied to OPE on a simple continuous grid task. In this setting, optimization stability is crucial, and this heavily depends on the form of the convex function $f$. We plot the results of using $f(x)=\frac{1}{p}|x|^p$ for $p\in[1.25,1.5,2,3,4]$. We also show the results of TD and IS methods on this task for comparison. We find that $p=1.5$ consistently performs the best, often providing significantly better results.
  • Figure 4: We perform OPE on control tasks (x-axis is training step) using our method compared to a number of additional baselines: doubly-robust (DR), in which one learns a value function in order to reduce the variance of an IS estimate of the evaluation; direct method (DM), in which one learns a model of the dynamics and reward of the environment and performs Monte Carlo rollouts using the model in order to estimate the value of the target policy; and $Q^\pi$, in which one learns $Q^\pi$ values via Bellman error minimization over the off-policy data, and uses the initial values $(1-\gamma)\cdot Q^\pi(s_0,a_0)$ as estimates of the policy value (these estimates are below $-0.4$ for Reacher, $\alpha=0$).
  • Figure 5: We perform OPE on additional control tasks (Acrobot and Pendulum) using our method compared to a number of baselines. We find that our method continues to perform well against previous OPE methods. Similar to the results in Figure \ref{['fig:control']}, we find that the baselines can perform reasonably well on discrete control (Acrobot) but performance degrades when in a continuous control setting (Pendulum).

Theorems & Definitions (9)

  • Theorem 2
  • Lemma 4
  • Lemma 5
  • Lemma 6: Statistical error ${\epsilon}_{est}\left(\mathcal{F}\right)$
  • proof
  • Lemma 7: Statistical error ${\epsilon}_{stat}$
  • proof
  • proof
  • Corollary 10