A Model-Based Approach for Improving Reinforcement Learning Efficiency Leveraging Expert Observations

Erhan Can Ozcan; Vittorio Giammarino; James Queeney; Ioannis Ch. Paschalidis

A Model-Based Approach for Improving Reinforcement Learning Efficiency Leveraging Expert Observations

Erhan Can Ozcan, Vittorio Giammarino, James Queeney, Ioannis Ch. Paschalidis

TL;DR

This work tackles RL sample efficiency when expert data provide only states rather than actions. It introduces SAC-EO, a variant of Soft Actor-Critic that adds a forward-dynamics–driven behavioral cloning term to the maximum-entropy objective and uses an adaptive weighting scheme based on model reliability, with the augmented loss $J_\pi(\theta,\phi,\mathcal{D},\mathcal{D}^e,\epsilon)=(1-\epsilon)J_\pi(\theta,\mathcal{D})+\epsilon\text{MSE}(\phi,\theta,\mathcal{D}^e)$ and $\epsilon_k=\frac{1}{1+\beta \delta^{max}(\mathcal{D}^e)}$. The forward model predicts next states $\widehat{s}_{t+1}\sim\mathcal{N}(\mu_\phi(s_t,a_t),\Sigma_\phi(s_t,a_t))$ to align policy transitions with expert trajectories using only observations. Experiments on six DeepMind Control Suite tasks show that SAC-EO dramatically speeds up learning, often reaching near-expert performance in under $10^6$ steps and outperforming SAC, MPO, and modified-BCO, with adaptive $\epsilon_k$ providing additional gains. The results demonstrate a practical, general framework for leveraging state-only expert data, with potential extensions to other algorithms and visual observations.

Abstract

This paper investigates how to incorporate expert observations (without explicit information on expert actions) into a deep reinforcement learning setting to improve sample efficiency. First, we formulate an augmented policy loss combining a maximum entropy reinforcement learning objective with a behavioral cloning loss that leverages a forward dynamics model. Then, we propose an algorithm that automatically adjusts the weights of each component in the augmented loss function. Experiments on a variety of continuous control tasks demonstrate that the proposed algorithm outperforms various benchmarks by effectively utilizing available expert observations.

A Model-Based Approach for Improving Reinforcement Learning Efficiency Leveraging Expert Observations

TL;DR

and

. The forward model predicts next states

to align policy transitions with expert trajectories using only observations. Experiments on six DeepMind Control Suite tasks show that SAC-EO dramatically speeds up learning, often reaching near-expert performance in under

steps and outperforming SAC, MPO, and modified-BCO, with adaptive

providing additional gains. The results demonstrate a practical, general framework for leveraging state-only expert data, with potential extensions to other algorithms and visual observations.

Abstract

Paper Structure (10 sections, 11 equations, 1 figure, 3 tables, 1 algorithm)

This paper contains 10 sections, 11 equations, 1 figure, 3 tables, 1 algorithm.

INTRODUCTION
RELATED WORK
PRELIMINARIES
MAXIMUM ENTROPY POLICY LEARNING WITH EXPERT OBSERVATIONS
Automatic Adjustment of the Expert State Matching Coefficient
Experiments
Performance Comparison on DeepMind Control Suite
The Benefit of Adaptive Epsilon
CONCLUSION
Implementation Details and Hyperparameters

Figures (1)

Figure 1: Comparison of algorithms across tasks. Horizontal blue line represents the expert performance. SAC-EO and modified-BCO are supplied four expert trajectories during training. The scale parameter of SAC-EO is the same across all tasks. Shading denotes one standard error across policies.

A Model-Based Approach for Improving Reinforcement Learning Efficiency Leveraging Expert Observations

TL;DR

Abstract

A Model-Based Approach for Improving Reinforcement Learning Efficiency Leveraging Expert Observations

Authors

TL;DR

Abstract

Table of Contents

Figures (1)