Table of Contents
Fetching ...

Single-Reset Divide & Conquer Imitation Learning

Alexandre Chenu, Olivier Serris, Olivier Sigaud, Nicolas Perrin-Gilbert

TL;DR

SR-DCIL investigates learning control policies under a weaker reset assumption by extending DCIL-II to a single initial reset. It introduces three mechanisms—Demo-Buffer, Value Cloning, and Approximate Goal Switching—to guide learning and enable training for distant goals. The authors provide a detailed algorithm and ablations on two robotic tasks (Dubins Maze and Fetch) showing that DB offers stronger guidance in low-dimensional settings and that AGS improves distant-goal efficiency, with mixed results in high-dimensional tasks. This work highlights the reset assumption's critical role in DCIL and lays a foundation for more versatile imitation-learning methods that operate with minimal resets.

Abstract

Demonstrations are commonly used to speed up the learning process of Deep Reinforcement Learning algorithms. To cope with the difficulty of accessing multiple demonstrations, some algorithms have been developed to learn from a single demonstration. In particular, the Divide & Conquer Imitation Learning algorithms leverage a sequential bias to learn a control policy for complex robotic tasks using a single state-based demonstration. The latest version, DCIL-II demonstrates remarkable sample efficiency. This novel method operates within an extended Goal-Conditioned Reinforcement Learning framework, ensuring compatibility between intermediate and subsequent goals extracted from the demonstration. However, a fundamental limitation arises from the assumption that the system can be reset to specific states along the demonstrated trajectory, confining the application to simulated systems. In response, we introduce an extension called Single-Reset DCIL (SR-DCIL), designed to overcome this constraint by relying on a single initial state reset rather than sequential resets. To address this more challenging setting, we integrate two mechanisms inspired by the Learning from Demonstrations literature, including a Demo-Buffer and Value Cloning, to guide the agent toward compatible success states. In addition, we introduce Approximate Goal Switching to facilitate training to reach goals distant from the reset state. Our paper makes several contributions, highlighting the importance of the reset assumption in DCIL-II, presenting the mechanisms of SR-DCIL variants and evaluating their performance in challenging robotic tasks compared to DCIL-II. In summary, this work offers insights into the significance of reset assumptions in the framework of DCIL and proposes SR-DCIL, a first step toward a versatile algorithm capable of learning control policies under a weaker reset assumption.

Single-Reset Divide & Conquer Imitation Learning

TL;DR

SR-DCIL investigates learning control policies under a weaker reset assumption by extending DCIL-II to a single initial reset. It introduces three mechanisms—Demo-Buffer, Value Cloning, and Approximate Goal Switching—to guide learning and enable training for distant goals. The authors provide a detailed algorithm and ablations on two robotic tasks (Dubins Maze and Fetch) showing that DB offers stronger guidance in low-dimensional settings and that AGS improves distant-goal efficiency, with mixed results in high-dimensional tasks. This work highlights the reset assumption's critical role in DCIL and lays a foundation for more versatile imitation-learning methods that operate with minimal resets.

Abstract

Demonstrations are commonly used to speed up the learning process of Deep Reinforcement Learning algorithms. To cope with the difficulty of accessing multiple demonstrations, some algorithms have been developed to learn from a single demonstration. In particular, the Divide & Conquer Imitation Learning algorithms leverage a sequential bias to learn a control policy for complex robotic tasks using a single state-based demonstration. The latest version, DCIL-II demonstrates remarkable sample efficiency. This novel method operates within an extended Goal-Conditioned Reinforcement Learning framework, ensuring compatibility between intermediate and subsequent goals extracted from the demonstration. However, a fundamental limitation arises from the assumption that the system can be reset to specific states along the demonstrated trajectory, confining the application to simulated systems. In response, we introduce an extension called Single-Reset DCIL (SR-DCIL), designed to overcome this constraint by relying on a single initial state reset rather than sequential resets. To address this more challenging setting, we integrate two mechanisms inspired by the Learning from Demonstrations literature, including a Demo-Buffer and Value Cloning, to guide the agent toward compatible success states. In addition, we introduce Approximate Goal Switching to facilitate training to reach goals distant from the reset state. Our paper makes several contributions, highlighting the importance of the reset assumption in DCIL-II, presenting the mechanisms of SR-DCIL variants and evaluating their performance in challenging robotic tasks compared to DCIL-II. In summary, this work offers insights into the significance of reset assumptions in the framework of DCIL and proposes SR-DCIL, a first step toward a versatile algorithm capable of learning control policies under a weaker reset assumption.
Paper Structure (28 sections, 7 equations, 5 figures, 1 algorithm)

This paper contains 28 sections, 7 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: Limitations caused by a reset to a single state. The agent needs to reach $g_{1}$ to train for $g_{2}$ and to reach both $g_{1}$ and $g_{2}$ to train for $g_{3}$. Moreover, to successively reach each goal, the agent must transit between successive sets of valid success states (in pink). Until the agent learns how to reach the second goal, the values of valid and invalid success states associated with the first goal are similar ($V_{\theta}(s,g_{1},1) \approx 1, \forall s \in \mathcal{S}_{g_{1}}$ until the agent learned how to reach $g_{2}$). Therefore, the agent is not encouraged to target valid success states. This results in a large number of wasted training trajectories (in red), as they were launched from invalid success states. When the agent mostly reaches invalid success states (as for goal $g_{2}$ here, which has a small set of valid success states), training for the next goal can become very challenging as most training rollouts for $g_{3}$ start from incompatible states.
  • Figure 2: Visualising the impact of the Demo Buffer and Value Cloning after 15k training steps. When SR-DCIL is not equipped with a DB or VC (SR-DCIL w/o DB & VC), the agent is not guided to valid success states by an increase of the Q-value of demonstrated state-action pairs (as in SR-DCIL w/ DB) or an increase of the value of demonstrated states (as in SR-DCIL w/ VC). As a result, it fails to achieve $g_{1}$ via valid success states and cannot reach $g_{2}$. On the contrary, SR-DCIL w/ DB and SR-DCIL w/ VC encourage the agent to achieve $g_{1}$ by reaching valid success states and manage to reach $g_{2}$.
  • Figure 3: Illustration of the Approximated Goal Switching concept in a toy 2D maze where the agent corresponds to a Dubins Car dubins1957curves with $(x,y,\theta)$ states and $(x,y)$ goals. The contours represents the maximum value function obtained in the $(x,y)$ position by uniformly sampling 20 orientations, after 15k SR-DCIL training steps. In the green trajectory, the agent triggered AGS by entering the blue zone. $k$-steps after, the goal is automatically switched to the next one in $\tau_{\mathcal{G}}$, and the agent can continue its progression in the maze. On the contrary, in the red trajectory w/o AGS, after the irrecoverable narrow miss of the first goal, the agent is still conditioned on this goal and eventually collides with the wall while trying to turn toward the goal (for better readability, we drop the index and the goal in the Q-function entry when it is not necessary).
  • Figure 4: Ablation study. Comparing different variants of SR-DCIL in the Dubins Maze and the Fetch environments: we evaluate the success rates of SR-DCIL with two different mechanisms (EB and VC) to encourage the agent to reach each goal via valid success states. Both variants are evaluated with and without AGS. In addition, we evaluate a vanilla version of SR-DCIL without AGS, VC or DB equivalent with DCIL-II with a reset to a single state. The mean and standard deviation are computed over 10 seeds. The standard deviation is divided by two for better visualization.
  • Figure 5: Success and failure modes of VC. In the Fetch environment, $35\%$ of the runs fail to learn how to grasp the object. As expected, the Value Cloning mechanism sets the value of the demonstrated states to their theoretical value. Here, in the two selected runs, the evolution of the learned value $V(s)$ of the last demonstrated state before the complex grasping behavior is plot in red and matches with the theoretical value in black. In the successful run (top panel), the agent visits states similar to the demonstrated states. Therefore, the high value is propagated in the Q-value (on-policy Q-value of the last state before grasping in blue) which impacts the policy and guides the agent toward valid success states. However, in the failing run (bottom panel), those states are hardly visited by the agent while training. Therefore, their high value has little to no impact on the Q-value (on-policy Q-value of the last state before grasping in blue) and the policy. As a result, the agent is not guided toward valid success states.