Table of Contents
Fetching ...

Imitation Learning from Observation with Automatic Discount Scheduling

Yuyang Liu, Weijun Dong, Yingdong Hu, Chuan Wen, Zhao-Heng Yin, Chongjie Zhang, Yang Gao

TL;DR

The paper tackles Imitation Learning from Observation (ILfO) when demonstrations lack actions and reveals a progress-dependency challenge where learning early behaviors is hindered by later-stage rewards. It proposes Automatic Discount Scheduling (ADS), a dynamic discount factor mechanism guided by a LIS-based progress recognizer, to prioritize early imitation before later steps. By combining ADS with Optimal Transport-based proxy rewards over observation trajectories, the approach yields strong improvements across nine Meta-World manipulation tasks, especially on harder, progression-critical skills. ADS demonstrates robust performance and adaptability, offering a practical path to leverage unlabeled video demonstrations for complex manipulation tasks.

Abstract

Humans often acquire new skills through observation and imitation. For robotic agents, learning from the plethora of unlabeled video demonstration data available on the Internet necessitates imitating the expert without access to its action, presenting a challenge known as Imitation Learning from Observations (ILfO). A common approach to tackle ILfO problems is to convert them into inverse reinforcement learning problems, utilizing a proxy reward computed from the agent's and the expert's observations. Nonetheless, we identify that tasks characterized by a progress dependency property pose significant challenges for such approaches; in these tasks, the agent needs to initially learn the expert's preceding behaviors before mastering the subsequent ones. Our investigation reveals that the main cause is that the reward signals assigned to later steps hinder the learning of initial behaviors. To address this challenge, we present a novel ILfO framework that enables the agent to master earlier behaviors before advancing to later ones. We introduce an Automatic Discount Scheduling (ADS) mechanism that adaptively alters the discount factor in reinforcement learning during the training phase, prioritizing earlier rewards initially and gradually engaging later rewards only when the earlier behaviors have been mastered. Our experiments, conducted on nine Meta-World tasks, demonstrate that our method significantly outperforms state-of-the-art methods across all tasks, including those that are unsolvable by them.

Imitation Learning from Observation with Automatic Discount Scheduling

TL;DR

The paper tackles Imitation Learning from Observation (ILfO) when demonstrations lack actions and reveals a progress-dependency challenge where learning early behaviors is hindered by later-stage rewards. It proposes Automatic Discount Scheduling (ADS), a dynamic discount factor mechanism guided by a LIS-based progress recognizer, to prioritize early imitation before later steps. By combining ADS with Optimal Transport-based proxy rewards over observation trajectories, the approach yields strong improvements across nine Meta-World manipulation tasks, especially on harder, progression-critical skills. ADS demonstrates robust performance and adaptability, offering a practical path to leverage unlabeled video demonstrations for complex manipulation tasks.

Abstract

Humans often acquire new skills through observation and imitation. For robotic agents, learning from the plethora of unlabeled video demonstration data available on the Internet necessitates imitating the expert without access to its action, presenting a challenge known as Imitation Learning from Observations (ILfO). A common approach to tackle ILfO problems is to convert them into inverse reinforcement learning problems, utilizing a proxy reward computed from the agent's and the expert's observations. Nonetheless, we identify that tasks characterized by a progress dependency property pose significant challenges for such approaches; in these tasks, the agent needs to initially learn the expert's preceding behaviors before mastering the subsequent ones. Our investigation reveals that the main cause is that the reward signals assigned to later steps hinder the learning of initial behaviors. To address this challenge, we present a novel ILfO framework that enables the agent to master earlier behaviors before advancing to later ones. We introduce an Automatic Discount Scheduling (ADS) mechanism that adaptively alters the discount factor in reinforcement learning during the training phase, prioritizing earlier rewards initially and gradually engaging later rewards only when the earlier behaviors have been mastered. Our experiments, conducted on nine Meta-World tasks, demonstrate that our method significantly outperforms state-of-the-art methods across all tasks, including those that are unsolvable by them.
Paper Structure (27 sections, 5 equations, 16 figures, 1 table, 1 algorithm)

This paper contains 27 sections, 5 equations, 16 figures, 1 table, 1 algorithm.

Figures (16)

  • Figure 1: An example of employing proxy-reward-based ILfO methods on a task with progress dependency. For the task basketball, (a) taking only the initial part of the expert demonstration as the imitation objective, the agent efficiently acquires the grasping skill; (b) taking the entire expert demonstration as the imitation objective, the agent fails to grasp the ball and instead sweeps it away.
  • Figure 2: (a) The agent learns a suboptimal policy that sweeps the ball away. (b) The agent can also collect explorative trajectories that successfully pick the ball up for a certain height, but it still fails to acquire this skill.
  • Figure 3: Evaluation ILfO methods on 9 Meta-world tasks (2 million environment frames). Each curve reports the mean and standard deviation over 8 random seeds.
  • Figure 4: Comparing OT+ADS against OT equipped with a fixed discount factor (1 million environment frames).
  • Figure 5: Comparing OT+ADS against OT equipped with an exponential discount scheduling (1 million environment frames). The discount factor for the baselines exponentially increases from $0.9$ to $0.99$ within 0.5 or 1 million environment frames.
  • ...and 11 more figures