Table of Contents
Fetching ...

Extrinsicaly Rewarded Soft Q Imitation Learning with Discriminator

Ryoma Furuyama, Daiki Kuyoshi, Satoshi Yamane

TL;DR

DSQIL tackles data-efficient imitation learning under distribution shift by replacing SQIL's fixed reward with a discriminator-based reward $R = D(s,a)/2$, where the discriminator is trained to distinguish expert from sample transitions. It defines DSQIL as an objective that blends Behavioral Cloning loss with a squared soft Bellman error guided by the discriminator, enabling learning with both discrete and continuous actions. Experiments on MuJoCo tasks (Hopper, Walker2d, HalfCheetah) using SAC show DSQIL matches or surpasses SQIL, especially when expert data is scarce, and reveal dynamic rewards that emphasize expert-like behavior in challenging states. Overall, DSQIL improves data efficiency and robustness to distribution shift, with future work focusing on discriminator accuracy effects and broader evaluation across tasks and reward designs.

Abstract

Imitation learning is often used in addition to reinforcement learning in environments where reward design is difficult or where the reward is sparse, but it is difficult to be able to imitate well in unknown states from a small amount of expert data and sampling data. Supervised learning methods such as Behavioral Cloning do not require sampling data, but usually suffer from distribution shift. The methods based on reinforcement learning, such as inverse reinforcement learning and Generative Adversarial imitation learning (GAIL), can learn from only a few expert data. However, they often need to interact with the environment. Soft Q imitation learning (SQIL) addressed the problems, and it was shown that it could learn efficiently by combining Behavioral Cloning and soft Q-learning with constant rewards. In order to make this algorithm more robust to distribution shift, we propose more efficient and robust algorithm by adding to this method a reward function based on adversarial inverse reinforcement learning that rewards the agent for performing actions in status similar to the demo. We call this algorithm Discriminator Soft Q Imitation Learning (DSQIL). We evaluated it on MuJoCo environments.

Extrinsicaly Rewarded Soft Q Imitation Learning with Discriminator

TL;DR

DSQIL tackles data-efficient imitation learning under distribution shift by replacing SQIL's fixed reward with a discriminator-based reward , where the discriminator is trained to distinguish expert from sample transitions. It defines DSQIL as an objective that blends Behavioral Cloning loss with a squared soft Bellman error guided by the discriminator, enabling learning with both discrete and continuous actions. Experiments on MuJoCo tasks (Hopper, Walker2d, HalfCheetah) using SAC show DSQIL matches or surpasses SQIL, especially when expert data is scarce, and reveal dynamic rewards that emphasize expert-like behavior in challenging states. Overall, DSQIL improves data efficiency and robustness to distribution shift, with future work focusing on discriminator accuracy effects and broader evaluation across tasks and reward designs.

Abstract

Imitation learning is often used in addition to reinforcement learning in environments where reward design is difficult or where the reward is sparse, but it is difficult to be able to imitate well in unknown states from a small amount of expert data and sampling data. Supervised learning methods such as Behavioral Cloning do not require sampling data, but usually suffer from distribution shift. The methods based on reinforcement learning, such as inverse reinforcement learning and Generative Adversarial imitation learning (GAIL), can learn from only a few expert data. However, they often need to interact with the environment. Soft Q imitation learning (SQIL) addressed the problems, and it was shown that it could learn efficiently by combining Behavioral Cloning and soft Q-learning with constant rewards. In order to make this algorithm more robust to distribution shift, we propose more efficient and robust algorithm by adding to this method a reward function based on adversarial inverse reinforcement learning that rewards the agent for performing actions in status similar to the demo. We call this algorithm Discriminator Soft Q Imitation Learning (DSQIL). We evaluated it on MuJoCo environments.
Paper Structure (16 sections, 19 equations, 12 figures, 4 tables, 1 algorithm)

This paper contains 16 sections, 19 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: Generative Adversarial Network Architecture.
  • Figure 2: The overall of DSQIL algorithm.
  • Figure 3: Hopper-v3
  • Figure 4: Walker2d-v3
  • Figure 5: HalfCheetah-v3
  • ...and 7 more figures