Table of Contents
Fetching ...

A Model-Based Approach for Improving Reinforcement Learning Efficiency Leveraging Expert Observations

Erhan Can Ozcan, Vittorio Giammarino, James Queeney, Ioannis Ch. Paschalidis

TL;DR

This work tackles RL sample efficiency when expert data provide only states rather than actions. It introduces SAC-EO, a variant of Soft Actor-Critic that adds a forward-dynamics–driven behavioral cloning term to the maximum-entropy objective and uses an adaptive weighting scheme based on model reliability, with the augmented loss $J_\pi(\theta,\phi,\mathcal{D},\mathcal{D}^e,\epsilon)=(1-\epsilon)J_\pi(\theta,\mathcal{D})+\epsilon\text{MSE}(\phi,\theta,\mathcal{D}^e)$ and $\epsilon_k=\frac{1}{1+\beta \delta^{max}(\mathcal{D}^e)}$. The forward model predicts next states $\widehat{s}_{t+1}\sim\mathcal{N}(\mu_\phi(s_t,a_t),\Sigma_\phi(s_t,a_t))$ to align policy transitions with expert trajectories using only observations. Experiments on six DeepMind Control Suite tasks show that SAC-EO dramatically speeds up learning, often reaching near-expert performance in under $10^6$ steps and outperforming SAC, MPO, and modified-BCO, with adaptive $\epsilon_k$ providing additional gains. The results demonstrate a practical, general framework for leveraging state-only expert data, with potential extensions to other algorithms and visual observations.

Abstract

This paper investigates how to incorporate expert observations (without explicit information on expert actions) into a deep reinforcement learning setting to improve sample efficiency. First, we formulate an augmented policy loss combining a maximum entropy reinforcement learning objective with a behavioral cloning loss that leverages a forward dynamics model. Then, we propose an algorithm that automatically adjusts the weights of each component in the augmented loss function. Experiments on a variety of continuous control tasks demonstrate that the proposed algorithm outperforms various benchmarks by effectively utilizing available expert observations.

A Model-Based Approach for Improving Reinforcement Learning Efficiency Leveraging Expert Observations

TL;DR

This work tackles RL sample efficiency when expert data provide only states rather than actions. It introduces SAC-EO, a variant of Soft Actor-Critic that adds a forward-dynamics–driven behavioral cloning term to the maximum-entropy objective and uses an adaptive weighting scheme based on model reliability, with the augmented loss and . The forward model predicts next states to align policy transitions with expert trajectories using only observations. Experiments on six DeepMind Control Suite tasks show that SAC-EO dramatically speeds up learning, often reaching near-expert performance in under steps and outperforming SAC, MPO, and modified-BCO, with adaptive providing additional gains. The results demonstrate a practical, general framework for leveraging state-only expert data, with potential extensions to other algorithms and visual observations.

Abstract

This paper investigates how to incorporate expert observations (without explicit information on expert actions) into a deep reinforcement learning setting to improve sample efficiency. First, we formulate an augmented policy loss combining a maximum entropy reinforcement learning objective with a behavioral cloning loss that leverages a forward dynamics model. Then, we propose an algorithm that automatically adjusts the weights of each component in the augmented loss function. Experiments on a variety of continuous control tasks demonstrate that the proposed algorithm outperforms various benchmarks by effectively utilizing available expert observations.
Paper Structure (10 sections, 11 equations, 1 figure, 3 tables, 1 algorithm)

This paper contains 10 sections, 11 equations, 1 figure, 3 tables, 1 algorithm.

Figures (1)

  • Figure 1: Comparison of algorithms across tasks. Horizontal blue line represents the expert performance. SAC-EO and modified-BCO are supplied four expert trajectories during training. The scale parameter of SAC-EO is the same across all tasks. Shading denotes one standard error across policies.