Adaptive Horizon Actor-Critic for Policy Learning in Contact-Rich Differentiable Simulation

Ignat Georgiev; Krishnan Srinivasan; Jie Xu; Eric Heiden; Animesh Garg

Adaptive Horizon Actor-Critic for Policy Learning in Contact-Rich Differentiable Simulation

Ignat Georgiev, Krishnan Srinivasan, Jie Xu, Eric Heiden, Animesh Garg

TL;DR

This work tackles the gradient-variance bottleneck in continuous control by contrasting zeroth-order model-free methods with first-order model-based approaches in differentiable simulators. It introduces Adaptive Horizon Actor-Critic (AHAC), which adaptively truncates model-based rollouts at contact to avoid stiff-gradient errors, guided by a horizon constraint set by a threshold $\,C$. Through a dual formulation and a double-critic architecture, AHAC achieves superior asymptotic rewards across multiple locomotion tasks and scales to high-dimensional control (up to $152$ actions), outperforming strong MFRL baselines. The results demonstrate the viability and benefits of horizon adaptation in FO-MBRL within differentiable simulators, pointing to further improvements via simulator fidelity and parallel training efficiency.

Abstract

Model-Free Reinforcement Learning (MFRL), leveraging the policy gradient theorem, has demonstrated considerable success in continuous control tasks. However, these approaches are plagued by high gradient variance due to zeroth-order gradient estimation, resulting in suboptimal policies. Conversely, First-Order Model-Based Reinforcement Learning (FO-MBRL) methods employing differentiable simulation provide gradients with reduced variance but are susceptible to sampling error in scenarios involving stiff dynamics, such as physical contact. This paper investigates the source of this error and introduces Adaptive Horizon Actor-Critic (AHAC), an FO-MBRL algorithm that reduces gradient error by adapting the model-based horizon to avoid stiff dynamics. Empirical findings reveal that AHAC outperforms MFRL baselines, attaining 40% more reward across a set of locomotion tasks and efficiently scaling to high-dimensional control environments with improved wall-clock-time efficiency.

Adaptive Horizon Actor-Critic for Policy Learning in Contact-Rich Differentiable Simulation

TL;DR

. Through a dual formulation and a double-critic architecture, AHAC achieves superior asymptotic rewards across multiple locomotion tasks and scales to high-dimensional control (up to

actions), outperforming strong MFRL baselines. The results demonstrate the viability and benefits of horizon adaptation in FO-MBRL within differentiable simulators, pointing to further improvements via simulator fidelity and parallel training efficiency.

Abstract

Paper Structure (20 sections, 2 theorems, 26 equations, 15 figures, 8 tables, 2 algorithms)

This paper contains 20 sections, 2 theorems, 26 equations, 15 figures, 8 tables, 2 algorithms.

Introduction
Preliminaries
Zeroth-Order Batch Gradient (ZOBG) estimates
First-Order Batch Gradient (FOBG) estimates
Policy learning through contact
Adaptive Horizon Actor-Critic (AHAC)
Learning through contact in a single environment
Scaling learning with synchronous parallelization
Experiments
Related work
Conclusion
Heaviside example
Proof of Lemma \ref{['lem:bias-bound']}
Summary of modifications
AHAC-1 algorithm
...and 5 more sections

Key Result

Lemma 2.5

Under Assumptions ass:dirac-delta and ass:cont-policy, the ZOBG is an unbiased estimator of the stochastic objective $\mathop{\mathrm{\mathbb{E}}}\nolimits_{}[*]{\Bar{\nabla}^{[0]} J({\bm{\theta}})} = \nabla J ({\bm{\theta}})$ where $\bar{\nabla}^{[0]} J({\bm{\theta}})$ is the sample mean of $N$ Mon

Figures (15)

Figure 1: Overview. We find that First Order Model-Based RL (FO-MBRL) methods suffer from erroneous gradients arising from stiff dynamics $\left( \left\lVert\nabla f(s,a)\right\rVert \gg 0 \right)$. Our proposed method, AHAC, truncates model-based trajectories at the point of contact, avoiding both the gradient sample error and learning instability exhibited by previous methods using differentiable simulation.
Figure 2: The left figure shows the Soft Heaviside of Eq \ref{['eq:soft-heaviside']}. The right figure shows the gradient sample error. We observe that FOBG estimates with finite $N$ exhibit a higher sample error.
Figure 3: Toy example where a ball is shot against a wall trying to reach the target position in blue. The bottom two figures show gradient sample error and Expected SNR estimation with $N=1024$ samples. Darker shades designate point of contact, which negatively impact FOBG error. Higher ESNR leads to more informative gradients.
Figure 4: Example $H=3$ step trajectory where ${\bm{s}}_3$ is in contact at which point the trajectory is truncated. When optimizing this trajectory, we completely omit the stiff dynamics gradient $\nabla f({\bm{s}}_2, {\bm{a}}_2)$ leading to stabler and less erroneous FOBGs.
Figure 5: Comparison between SHAC and AHAC-1 on the Hopper task with only a single environment. The figure shows rewards and horizons achieved over 5 different random seeds, with the 50% IQM plotted. Note that both algorithms have some horizon oscillation due to the early termination mechanism of the simulator, as noted in Appendix \ref{['app:env-details']}.
...and 10 more figures

Theorems & Definitions (6)

Definition 2.2
Definition 2.4
Lemma 2.5
Definition 3.1
Lemma 3.2
proof

Adaptive Horizon Actor-Critic for Policy Learning in Contact-Rich Differentiable Simulation

TL;DR

Abstract

Adaptive Horizon Actor-Critic for Policy Learning in Contact-Rich Differentiable Simulation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (6)