General Flexible $f$-divergence for Challenging Offline RL Datasets with Low Stochasticity and Diverse Behavior Policies

Jianxun Wang; Grant C. Forbes; Leonardo Villalobos-Arias; David L. Roberts

General Flexible $f$-divergence for Challenging Offline RL Datasets with Low Stochasticity and Diverse Behavior Policies

Jianxun Wang, Grant C. Forbes, Leonardo Villalobos-Arias, David L. Roberts

TL;DR

The paper tackles offline RL when training data are limited in stochasticity and comprise mixtures of behavior policies. It develops a general Linear Programming formulation that links $f$-divergence to the Bellman residual, and introduces a flexible two-branch $f$-divergence $g^*_{\alpha_-,\alpha_+,\beta}(\zeta)$ to adapt dataset-constrained learning. The authors unify primal and dual RL objectives under constrained LP, provide heuristic methods to estimate the divergence parameters, and demonstrate that Flex-$f$-Q and Flex-$f$-DICE achieve competitive or superior performance on MuJoCo, Fetch, and AdroitHand datasets compared to baseline offline RL methods. This dataset-aware regularization can improve learning from challenging offline datasets and points toward automated adaptation of divergence constraints during training.

Abstract

Offline RL algorithms aim to improve upon the behavior policy that produces the collected data while constraining the learned policy to be within the support of the dataset. However, practical offline datasets often contain examples with little diversity or limited exploration of the environment, and from multiple behavior policies with diverse expertise levels. Limited exploration can impair the offline RL algorithm's ability to estimate \textit{Q} or \textit{V} values, while constraining towards diverse behavior policies can be overly conservative. Such datasets call for a balance between the RL objective and behavior policy constraints. We first identify the connection between $f$-divergence and optimization constraint on the Bellman residual through a more general Linear Programming form for RL and the convex conjugate. Following this, we introduce the general flexible function formulation for the $f$-divergence to incorporate an adaptive constraint on algorithms' learning objectives based on the offline training dataset. Results from experiments on the MuJoCo, Fetch, and AdroitHand environments show the correctness of the proposed LP form and the potential of the flexible $f$-divergence in improving performance for learning from a challenging dataset when applied to a compatible constrained optimization algorithm.

General Flexible $f$-divergence for Challenging Offline RL Datasets with Low Stochasticity and Diverse Behavior Policies

TL;DR

The paper tackles offline RL when training data are limited in stochasticity and comprise mixtures of behavior policies. It develops a general Linear Programming formulation that links

-divergence to the Bellman residual, and introduces a flexible two-branch

-divergence

to adapt dataset-constrained learning. The authors unify primal and dual RL objectives under constrained LP, provide heuristic methods to estimate the divergence parameters, and demonstrate that Flex-

-Q and Flex-

-DICE achieve competitive or superior performance on MuJoCo, Fetch, and AdroitHand datasets compared to baseline offline RL methods. This dataset-aware regularization can improve learning from challenging offline datasets and points toward automated adaptation of divergence constraints during training.

Abstract

-divergence and optimization constraint on the Bellman residual through a more general Linear Programming form for RL and the convex conjugate. Following this, we introduce the general flexible function formulation for the

-divergence to incorporate an adaptive constraint on algorithms' learning objectives based on the offline training dataset. Results from experiments on the MuJoCo, Fetch, and AdroitHand environments show the correctness of the proposed LP form and the potential of the flexible

-divergence in improving performance for learning from a challenging dataset when applied to a compatible constrained optimization algorithm.

Paper Structure (25 sections, 1 theorem, 28 equations, 3 figures, 5 tables, 3 algorithms)

This paper contains 25 sections, 1 theorem, 28 equations, 3 figures, 5 tables, 3 algorithms.

Introduction
Background
The General LP formulation for RL
Alternative form of $L_P$
Connecting Bellman minimization to LP
Unified form of LP for RL
Flexible $f$-divergence
Functional Form for Flexible $f$-divergence
Heuristic estimation of $\alpha_\pm$ and $\beta$
Effect of base function, $\alpha_\pm$ and $\beta$
Related works
Experiments and Analysis
Conclusion
RL Algorithms under General Constrained LP
Equivalence of Existing RL algorithm in the General constrained LP formulation
...and 10 more sections

Key Result

Theorem 1

For a convex $g(\cdot)$ with $g^*(1)=0$ and $g^{*'}(1)=0$, $-x+g(x)$ is convex and $\min_x g(x)-x=0$ when $x=0$

Figures (3)

Figure 1: Dataset measurements across different compositions and algorithms' performance in them. (Top) Positive Scaled Variance (PSV) of the return, Normalized Expected Return (NER), and SACo as a measure of exploration; (Bottom) Algorithm performances across different dataset mixture and behavior policy stochasticity.
Figure 2: (Top) Example $f$ divergence function. The function for IQL is $\chi^2$ with $\alpha_-=\frac{10}{3}$ and $\alpha_+=\frac{10}{7}$, corresponding to the $70\%$-expectile regression. Le-Cam+$\chi^2$ shares the same coefficient. (Bottom) Illustration of $\alpha_-$, $\alpha_+$, and $\beta$'s effect. Default values are $1.0$. All functions use $\chi^2$ for as $g^*_+(\cdot)$.
Figure 3: $\alpha_\pm$ and $\beta$ changes throughout training.

Theorems & Definitions (2)

Theorem 1
Remark

General Flexible $f$-divergence for Challenging Offline RL Datasets with Low Stochasticity and Diverse Behavior Policies

TL;DR

Abstract

General Flexible $f$-divergence for Challenging Offline RL Datasets with Low Stochasticity and Diverse Behavior Policies

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (2)