A Model-Based Approach for Improving Reinforcement Learning Efficiency Leveraging Expert Observations
Erhan Can Ozcan, Vittorio Giammarino, James Queeney, Ioannis Ch. Paschalidis
TL;DR
This work tackles RL sample efficiency when expert data provide only states rather than actions. It introduces SAC-EO, a variant of Soft Actor-Critic that adds a forward-dynamics–driven behavioral cloning term to the maximum-entropy objective and uses an adaptive weighting scheme based on model reliability, with the augmented loss $J_\pi(\theta,\phi,\mathcal{D},\mathcal{D}^e,\epsilon)=(1-\epsilon)J_\pi(\theta,\mathcal{D})+\epsilon\text{MSE}(\phi,\theta,\mathcal{D}^e)$ and $\epsilon_k=\frac{1}{1+\beta \delta^{max}(\mathcal{D}^e)}$. The forward model predicts next states $\widehat{s}_{t+1}\sim\mathcal{N}(\mu_\phi(s_t,a_t),\Sigma_\phi(s_t,a_t))$ to align policy transitions with expert trajectories using only observations. Experiments on six DeepMind Control Suite tasks show that SAC-EO dramatically speeds up learning, often reaching near-expert performance in under $10^6$ steps and outperforming SAC, MPO, and modified-BCO, with adaptive $\epsilon_k$ providing additional gains. The results demonstrate a practical, general framework for leveraging state-only expert data, with potential extensions to other algorithms and visual observations.
Abstract
This paper investigates how to incorporate expert observations (without explicit information on expert actions) into a deep reinforcement learning setting to improve sample efficiency. First, we formulate an augmented policy loss combining a maximum entropy reinforcement learning objective with a behavioral cloning loss that leverages a forward dynamics model. Then, we propose an algorithm that automatically adjusts the weights of each component in the augmented loss function. Experiments on a variety of continuous control tasks demonstrate that the proposed algorithm outperforms various benchmarks by effectively utilizing available expert observations.
