Table of Contents
Fetching ...

VIEW: Visual Imitation Learning with Waypoints

Ananth Jonnavittula, Sagar Parekh, Dylan P. Losey

TL;DR

VIEW tackles the challenge of learning robot manipulation from a single human video by compressing the demonstration into a concise, task-focused prior of waypoints and aligning robot-object motion via agent-agnostic rewards. It introduces a two-phase exploration (grasp and task) within a quality-diversity framework, a residual network to de-noise priors across tasks, and an extension to multi-object and long-horizon scenarios. The approach yields substantial gains in sample efficiency, achieving under $30$ minutes and fewer than $20$ real-world rollouts for many tasks, and demonstrates strong performance in both simulated and real-world experiments, including cluttered environments. These results suggest VIEW provides a practical, scalable bridge from human video demonstrations to efficient, adaptable robot policies, with clear avenues for orientation-aware and pose-based extensions.

Abstract

Robots can use Visual Imitation Learning (VIL) to learn manipulation tasks from video demonstrations. However, translating visual observations into actionable robot policies is challenging due to the high-dimensional nature of video data. This challenge is further exacerbated by the morphological differences between humans and robots, especially when the video demonstrations feature humans performing tasks. To address these problems we introduce Visual Imitation lEarning with Waypoints (VIEW), an algorithm that significantly enhances the sample efficiency of human-to-robot VIL. VIEW achieves this efficiency using a multi-pronged approach: extracting a condensed prior trajectory that captures the demonstrator's intent, employing an agent-agnostic reward function for feedback on the robot's actions, and utilizing an exploration algorithm that efficiently samples around waypoints in the extracted trajectory. VIEW also segments the human trajectory into grasp and task phases to further accelerate learning efficiency. Through comprehensive simulations and real-world experiments, VIEW demonstrates improved performance compared to current state-of-the-art VIL methods. VIEW enables robots to learn manipulation tasks involving multiple objects from arbitrarily long video demonstrations. Additionally, it can learn standard manipulation tasks such as pushing or moving objects from a single video demonstration in under 30 minutes, with fewer than 20 real-world rollouts. Code and videos here: https://collab.me.vt.edu/view/

VIEW: Visual Imitation Learning with Waypoints

TL;DR

VIEW tackles the challenge of learning robot manipulation from a single human video by compressing the demonstration into a concise, task-focused prior of waypoints and aligning robot-object motion via agent-agnostic rewards. It introduces a two-phase exploration (grasp and task) within a quality-diversity framework, a residual network to de-noise priors across tasks, and an extension to multi-object and long-horizon scenarios. The approach yields substantial gains in sample efficiency, achieving under minutes and fewer than real-world rollouts for many tasks, and demonstrates strong performance in both simulated and real-world experiments, including cluttered environments. These results suggest VIEW provides a practical, scalable bridge from human video demonstrations to efficient, adaptable robot policies, with clear avenues for orientation-aware and pose-based extensions.

Abstract

Robots can use Visual Imitation Learning (VIL) to learn manipulation tasks from video demonstrations. However, translating visual observations into actionable robot policies is challenging due to the high-dimensional nature of video data. This challenge is further exacerbated by the morphological differences between humans and robots, especially when the video demonstrations feature humans performing tasks. To address these problems we introduce Visual Imitation lEarning with Waypoints (VIEW), an algorithm that significantly enhances the sample efficiency of human-to-robot VIL. VIEW achieves this efficiency using a multi-pronged approach: extracting a condensed prior trajectory that captures the demonstrator's intent, employing an agent-agnostic reward function for feedback on the robot's actions, and utilizing an exploration algorithm that efficiently samples around waypoints in the extracted trajectory. VIEW also segments the human trajectory into grasp and task phases to further accelerate learning efficiency. Through comprehensive simulations and real-world experiments, VIEW demonstrates improved performance compared to current state-of-the-art VIL methods. VIEW enables robots to learn manipulation tasks involving multiple objects from arbitrarily long video demonstrations. Additionally, it can learn standard manipulation tasks such as pushing or moving objects from a single video demonstration in under 30 minutes, with fewer than 20 real-world rollouts. Code and videos here: https://collab.me.vt.edu/view/
Paper Structure (21 sections, 6 equations, 14 figures)

This paper contains 21 sections, 6 equations, 14 figures.

Figures (14)

  • Figure 1: Robot learning from visual demonstration. 1) A human demonstrates the task directly in the environment: here we use the example of picking up a cup. Under our proposed approach, the robot processes a single video of that demonstration to selectively focus on important features such as the human hand and the manipulated object. From these trajectories the robot obtains waypoints that capture the critical parts of the task (e.g., grasping the cup). 2) These extracted waypoints serve as a prior for the correct robot trajectory. 3) In practice, simply executing this prior rarely leads to task success due in part to the morphological differences between human and robot (in this case, the robot misses the cup entirely). Therefore, the robot must explore in a region around the initial waypoints to iteratively improve its trajectory. 4) After repetitively interacting with the environment, the robot learns to successfully imitate the behavior demonstrated in the human video.
  • Figure 2: Outline of VIEW, our proposed method for human-to-robot visual imitation learning. (Top Left) VIEW begins with a single video demonstration of a task. (Bottom Left) From this video we extract the object of interest, its trajectory, and the human's human trajectory. (Middle) We then perform compression to obtain a trajectory prior --- a sequence of waypoints the robot arm should interpolate between to complete the task. Unfortunately, this initial trajectory is often imprecise due to the differences between human hands and robot grippers, as well as noise in the extraction process. We therefore refine the prior using a residual network, which is trained on previous tasks to de-noises the current data. (Right) The de-noised trajectory is then segmented into two phases: grasp exploration and task exploration. (Top Right) During grasp exploration, the robot determines how to pick up the object by modifying the pick point in its trajectory. (Bottom Right) Following a successful grasp, the robot proceeds to task exploration, where is simultaneously corrects the remaining waypoints of the trajectory. After completing exploration, the robot synthesizes a complete trajectory. (Middle) This solved trajectory, alongside the prior trajectory, is used to further train the residual network, thus enhancing the performance of our method in future tasks.
  • Figure 3: An overview of our prior extraction method (Bottom Left in Figure \ref{['fig:method']}). Utilizing the 100 Days of Hands ($100$DOH) detector shan2020understanding, we first identify the location of the hand and if it is in contact with any objects present in the frame. We then refine the human's hand trajectory using the MANO model romero2022embodied to capture wrist movements. Subsequently, to eliminate redundancy, we apply the SQUISHE algorithm muckell2014compression. This produces an initial trajectory with key waypoints that the robot should interpolate between. To pinpoint the object of interest amidst potential clutter, we analyze frames where hand-object contact occurs, creating anchor boxes that --- in conjunction with an object detector --- reveal the object the human interacts with most frequently. This identification enables us to construct an accurate object trajectory from the human's video.
  • Figure 4: Generating a bounding box for exploring grasp locations. We define a region around the waypoint $\omega^h_{grasp} = (x, y, z)$ where the human first interacted with the object in the video demonstration. (Top) A naive approach: the bounding box is centered around $\omega^h_{grasp}$ with limits $\Delta$. The principal diagonal of the bounding box is defined by $(x - \Delta, y - \Delta, z - \Delta)$ and $(x + \Delta, y + \Delta, z + \Delta)$. (Bottom) Our approach that leverages the estimated object location $\omega^o_{grasp}$ at the time of grasping to bias the search space. The principal diagonal of the bounding box are $\omega^h_{grasp} + \Delta \hat{j}$ and $\omega^o_{grasp} - \Delta \hat{j}$, here $\hat{j}$ is the unit vector parallel to the principal diagonal. This bounding box is typically smaller and is more likely to include an effective grasp location for the robot.
  • Figure 5: Comparison of different sampling methods in our high-level grasp exploration. We show an example task in a two-dimensional space which is bounded around the prior (black triangle) and the object (green star). (Top) Each new high-level waypoint point is uniformly randomly sampled from our set of unvisited waypoints. This method can eventually reach the object with sufficient exploration. However, new samples my be close to previously tested points. (Middle) To quickly reduce the uncertainty about the unknown object location, we can sample high-level waypoints that maximize the distance to all previously visited waypoints. We expect that these waypoints will explore new regions of the search space. In practice, however, the distance-based estimation from Equation (\ref{['eq:distance_metric']}) results in points that are clustered at the corners and center. (Bottom) Our proposed solution is to add a regularizing term in Equation (\ref{['eq:sampling_strategy']}) to ensure that the next high-level waypoint is truly from an unexplored region of workspace. Our experiments show that this approach finds the grasp location more rapidly.
  • ...and 9 more figures