Table of Contents
Fetching ...

RoboTAP: Tracking Arbitrary Points for Few-Shot Visual Imitation

Mel Vecerik, Carl Doersch, Yi Yang, Todor Davchev, Yusuf Aytar, Guangyao Zhou, Raia Hadsell, Lourdes Agapito, Jon Scholz

TL;DR

RoboTAP tackles the challenge of quickly teaching robots new visuomotor skills with minimal data by leveraging dense point tracking via TAPIR to extract actionable motion from demonstrations. It factorizes perception and control into what, where, and how, building a motion plan from few demonstrations and executing it with a robust 4-DoF visual-servoing controller. The paper introduces Online TAPIR for real-time operation, presents a new RoboTAP dataset of real-world robotic videos with ground-truth point tracks, and demonstrates complex manipulation tasks with mm-scale precision and strong robustness to clutter. While less general than fully end-to-end models, RoboTAP offers a data-efficient, interpretable framework that can be integrated with larger systems to enable scalable, real-world robotic manipulation.

Abstract

For robots to be useful outside labs and specialized factories we need a way to teach them new useful behaviors quickly. Current approaches lack either the generality to onboard new tasks without task-specific engineering, or else lack the data-efficiency to do so in an amount of time that enables practical use. In this work we explore dense tracking as a representational vehicle to allow faster and more general learning from demonstration. Our approach utilizes Track-Any-Point (TAP) models to isolate the relevant motion in a demonstration, and parameterize a low-level controller to reproduce this motion across changes in the scene configuration. We show this results in robust robot policies that can solve complex object-arrangement tasks such as shape-matching, stacking, and even full path-following tasks such as applying glue and sticking objects together, all from demonstrations that can be collected in minutes.

RoboTAP: Tracking Arbitrary Points for Few-Shot Visual Imitation

TL;DR

RoboTAP tackles the challenge of quickly teaching robots new visuomotor skills with minimal data by leveraging dense point tracking via TAPIR to extract actionable motion from demonstrations. It factorizes perception and control into what, where, and how, building a motion plan from few demonstrations and executing it with a robust 4-DoF visual-servoing controller. The paper introduces Online TAPIR for real-time operation, presents a new RoboTAP dataset of real-world robotic videos with ground-truth point tracks, and demonstrates complex manipulation tasks with mm-scale precision and strong robustness to clutter. While less general than fully end-to-end models, RoboTAP offers a data-efficient, interpretable framework that can be integrated with larger systems to enable scalable, real-world robotic manipulation.

Abstract

For robots to be useful outside labs and specialized factories we need a way to teach them new useful behaviors quickly. Current approaches lack either the generality to onboard new tasks without task-specific engineering, or else lack the data-efficiency to do so in an amount of time that enables practical use. In this work we explore dense tracking as a representational vehicle to allow faster and more general learning from demonstration. Our approach utilizes Track-Any-Point (TAP) models to isolate the relevant motion in a demonstration, and parameterize a low-level controller to reproduce this motion across changes in the scene configuration. We show this results in robust robot policies that can solve complex object-arrangement tasks such as shape-matching, stacking, and even full path-following tasks such as applying glue and sticking objects together, all from demonstrations that can be collected in minutes.
Paper Structure (24 sections, 11 equations, 10 figures, 8 tables)

This paper contains 24 sections, 11 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: An example of RoboTAP using points automatically selected from few demos ($\leq 6$) to define a long horizon behaviour. First row is illustrative, second row is what the agent sees. At every stage, the system identifies the current location of "active" points relevant to the stage (red). Given the goal locations for each point, from the demos (cyan), a desired motion for each point is produced (blue lines), and converted to a robot action using a generalized 4D visual-servoing primitive, which operates with arbitrary points.
  • Figure 2: Here we describe the core of the RoboTAP approach. Given a set of demonstrations D, we first track densely using TAPIR. Next, we temporally segment the demonstrations into stages, and then automatically discover the active point set$q$ for each stage, which covers the object whose motion is relevant at that stage of the action. We then form a motion plan that can be executed on the robot, which consists of stages of servoing to imitate visual motions, and basic motor primitives like closing and opening the gripper. Visual servoing is accomplished by detecting points $q$ using TAPIR, finding the nearest demonstration which shows how those points should move, and finding a single nearby frame that can be used as a motion target. The displacement between the points in the target frame ($g$) and the online TAPIR detections is used as a motion target for classical visual servoing, yielding surprisingly complex and robust behavior.
  • Figure 3: Active point selection. We exploit the funneling-nature of control to identify relevant points based on their variance, and remove the gripper by filtering static points. Remaining points are used to vote on the motion cluster, which we use to sample 128 points throughout the motion-segment to serve as the salient features for that step of the motion plan.
  • Figure 4: Examples of successfully solved real robot tasks. In order to challenge the system and demonstrate its robustness we show its performance on scenes with clutter, distracting objects and partial occlusions.
  • Figure 5: Examples of states where the system failed to reach the desired final state. Tasks which require sub 5mm precision cannot always be reliably solved (e.g. shape-matching). In addition, our use of a purely-visual control paradigm makes it difficult to solve tasks that require reasoning over visual and force modalities simultaneously (e.g. lego part-mating). Lastly, our controller is unable to reason about the validity of a motion-plan at runtime, which can lead to failures if certain motions are invalid (e.g. the bunny's ear is covering the jello too much to place the apple).
  • ...and 5 more figures