Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization
Sayambhu Sen, Shalabh Bhatnagar
TL;DR
The paper tackles the sample-inefficiency of adversarial imitation learning caused by on-policy training in GAIL. It introduces an off-policy, actor–critic IL framework with a bounded tanh actor, double Q-style stabilization, and reward learning integrated into the critic, enabling stable, data-efficient learning from expert demonstrations without requiring environment rewards. The method achieves expert-level performance on the BipedalWalker-v2 task in roughly 200k environment steps, significantly reducing sample complexity compared to the GAIL baseline, though high variance during training suggests avenues for further stabilization. This approach enhances the practical viability of imitation learning for real-world robotics and control tasks where environment interactions are costly.
Abstract
Learning complex policies with Reinforcement Learning (RL) is often hindered by instability and slow convergence, a problem exacerbated by the difficulty of reward engineering. Imitation Learning (IL) from expert demonstrations bypasses this reliance on rewards. However, state-of-the-art IL methods, exemplified by Generative Adversarial Imitation Learning (GAIL)Ho et. al, suffer from severe sample inefficiency. This is a direct consequence of their foundational on-policy algorithms, such as TRPO Schulman et.al. In this work, we introduce an adversarial imitation learning algorithm that incorporates off-policy learning to improve sample efficiency. By combining an off-policy framework with auxiliary techniques specifically, double Q network based stabilization and value learning without reward function inference we demonstrate a reduction in the samples required to robustly match expert behavior.
