RoboTAP: Tracking Arbitrary Points for Few-Shot Visual Imitation
Mel Vecerik, Carl Doersch, Yi Yang, Todor Davchev, Yusuf Aytar, Guangyao Zhou, Raia Hadsell, Lourdes Agapito, Jon Scholz
TL;DR
RoboTAP tackles the challenge of quickly teaching robots new visuomotor skills with minimal data by leveraging dense point tracking via TAPIR to extract actionable motion from demonstrations. It factorizes perception and control into what, where, and how, building a motion plan from few demonstrations and executing it with a robust 4-DoF visual-servoing controller. The paper introduces Online TAPIR for real-time operation, presents a new RoboTAP dataset of real-world robotic videos with ground-truth point tracks, and demonstrates complex manipulation tasks with mm-scale precision and strong robustness to clutter. While less general than fully end-to-end models, RoboTAP offers a data-efficient, interpretable framework that can be integrated with larger systems to enable scalable, real-world robotic manipulation.
Abstract
For robots to be useful outside labs and specialized factories we need a way to teach them new useful behaviors quickly. Current approaches lack either the generality to onboard new tasks without task-specific engineering, or else lack the data-efficiency to do so in an amount of time that enables practical use. In this work we explore dense tracking as a representational vehicle to allow faster and more general learning from demonstration. Our approach utilizes Track-Any-Point (TAP) models to isolate the relevant motion in a demonstration, and parameterize a low-level controller to reproduce this motion across changes in the scene configuration. We show this results in robust robot policies that can solve complex object-arrangement tasks such as shape-matching, stacking, and even full path-following tasks such as applying glue and sticking objects together, all from demonstrations that can be collected in minutes.
