Table of Contents
Fetching ...

Symmetric Behavior Regularized Policy Optimization

Lingwei Zhu, Haseeb Shah, Zheng Chen, Yukie Nagai, Martha White

TL;DR

This work investigates symmetric regularization in Behavior Regularized Policy Optimization (BRPO) for offline RL and proves that symmetric divergences generally do not admit analytic optimal policies π*. To make symmetric BRPO practical, it develops a finite Taylor expansion of f-divergences into Pearson-Vajda χ^n terms and shows analytic solutions exist when truncated at N<5, notably yielding a simple form at N=2. Building on this, the authors introduce Symmetric f-divergence Actor-Critic (S$f$-AC), which optimizes a loss combining a KL-based advantage regression term with a truncated conditional-symmetry term, and employs clipping for numerical stability. Empirically, S$f$-AC achieves strong performance on distribution-matching tasks and the D4RL MuJoCo offline suite, with fewer environment-specific failures than baselines like AWAC, XQL, SQL, and IQL, and demonstrates robustness to the number of symmetry terms and clipping thresholds. The results suggest symmetric BRPO is a viable, stable regularization option for offline RL when paired with a controlled Taylor-series approximation, though hyperparameters for the truncation and clipping remain important.

Abstract

Behavior Regularized Policy Optimization (BRPO) leverages asymmetric (divergence) regularization to mitigate the distribution shift in offline Reinforcement Learning. This paper is the first to study the open question of symmetric regularization. We show that symmetric regularization does not permit an analytic optimal policy $π^*$, posing a challenge to practical utility of symmetric BRPO. We approximate $π^*$ by the Taylor series of Pearson-Vajda $χ^n$ divergences and show that an analytic policy expression exists only when the series is capped at $n=5$. To compute the solution in a numerically stable manner, we propose to Taylor expand the conditional symmetry term of the symmetric divergence loss, leading to a novel algorithm: Symmetric $f$-Actor Critic (S$f$-AC). S$f$-AC achieves consistently strong results across various D4RL MuJoCo tasks. Additionally, S$f$-AC avoids per-environment failures observed in IQL, SQL, XQL and AWAC, opening up possibilities for more diverse and effective regularization choices for offline RL.

Symmetric Behavior Regularized Policy Optimization

TL;DR

This work investigates symmetric regularization in Behavior Regularized Policy Optimization (BRPO) for offline RL and proves that symmetric divergences generally do not admit analytic optimal policies π*. To make symmetric BRPO practical, it develops a finite Taylor expansion of f-divergences into Pearson-Vajda χ^n terms and shows analytic solutions exist when truncated at N<5, notably yielding a simple form at N=2. Building on this, the authors introduce Symmetric f-divergence Actor-Critic (S$f$-AC), which optimizes a loss combining a KL-based advantage regression term with a truncated conditional-symmetry term, and employs clipping for numerical stability. Empirically, S$f$-AC achieves strong performance on distribution-matching tasks and the D4RL MuJoCo offline suite, with fewer environment-specific failures than baselines like AWAC, XQL, SQL, and IQL, and demonstrates robustness to the number of symmetry terms and clipping thresholds. The results suggest symmetric BRPO is a viable, stable regularization option for offline RL when paired with a controlled Taylor-series approximation, though hyperparameters for the truncation and clipping remain important.

Abstract

Behavior Regularized Policy Optimization (BRPO) leverages asymmetric (divergence) regularization to mitigate the distribution shift in offline Reinforcement Learning. This paper is the first to study the open question of symmetric regularization. We show that symmetric regularization does not permit an analytic optimal policy , posing a challenge to practical utility of symmetric BRPO. We approximate by the Taylor series of Pearson-Vajda divergences and show that an analytic policy expression exists only when the series is capped at . To compute the solution in a numerically stable manner, we propose to Taylor expand the conditional symmetry term of the symmetric divergence loss, leading to a novel algorithm: Symmetric -Actor Critic (S-AC). S-AC achieves consistently strong results across various D4RL MuJoCo tasks. Additionally, S-AC avoids per-environment failures observed in IQL, SQL, XQL and AWAC, opening up possibilities for more diverse and effective regularization choices for offline RL.

Paper Structure

This paper contains 27 sections, 6 theorems, 33 equations, 11 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

Let the symmetric $f$-divergence be defined by Definition def:wider_symm. Then if $g'(t)$ does not make $f'(t)$ an affine function in $\ln t$, i.e. $f'(t) \neq a\ln t + b$, the regularized optimal policy $\pi^*$ does not have an analytic expression.

Figures (11)

  • Figure 1: Number of times each algorithm is amongst the top-3 performers on 9 D4RL MuJoCo tasks. Our method (S$f$-AC) remains relatively more stable across tasks compared to the baselines.
  • Figure 2: Approximating a mixture of Gaussians (black) by minimizing vanilla divergence (solid) and S$f$-AC loss for $N_\text{loss}\!=\!5$ (dashed). Vanilla JS loss causes the Gaussian to lose track of optimal $\sigma^*$ given by numerical integration.
  • Figure 3: S$f$-AC Jensen-Shannon and Jeffreys with $N_\text{loss}=3, \epsilon=100$ versus baselines on the D4RL MuJoCo environments. Solid lines are mean and shaded regions the standard deviation, averaged over 5 seeds. Only S$f$-AC methods are shown with full opacity. Both Jensen-Shannon and Jeffrey's divergences performed favorably compared to the baselines.
  • Figure 4: Policy evolution of S$f$-AC versus AWAC for the first $20\%$ of learning. Minimizing forward KL of AWAC increasingly prompts the policy beyond the minimum allowed action $-1$ (shaded area).
  • Figure 5: Ablation studies on HalfCheetah. (A) Scores of S$f$-AC across $N_\text{loss}$ when $\epsilon=100$. Scattered dots are evaluations from the last $20\%$ of learning. JS remains stable for large $N_\text{loss}$ as its series coefficients decay quickly to zero, see Table \ref{['table:approx_table']}. By contrast, Jeffreys performance decreases as $N_\text{loss}$ increases. (B) Scores under various clipping thresholds $\epsilon$ of JS when $N_\text{loss}=3$. Dashed vertical line shows the median performance of $\epsilon=0.2$. Overall, S$f$-AC is insensitive towards $\epsilon$.
  • ...and 6 more figures

Theorems & Definitions (13)

  • Definition 1
  • Definition 2: Pardo2006-divergences
  • Definition 3: Sason2016-fDiverInequalities
  • Theorem 1
  • proof
  • Corollary 1
  • Theorem 2
  • proof
  • Lemma 3: Nielsen2013-chiApproxFdiv
  • Theorem 3
  • ...and 3 more