DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models
Norman Di Palo, Edward Johns
TL;DR
DINOBot introduces a retrieval-plus-alignment imitation-learning framework that leverages DINO-ViT features to generalise robot manipulation to novel objects from minimal demonstrations. By separating semantic retrieval (image-level) from geometric alignment (pixel-level) and replaying demonstrated trajectories, it achieves superior data efficiency and generalisation compared to existing baselines. Extensive real-world experiments across tabletop and kitchen tasks show strong one-shot performance and resilience to distractors, highlighting the practical value of combining image-level and pixel-level visual reasoning in robotics. The approach emphasizes explicit, modular reasoning over end-to-end policies, enabling scalable adaptation to new objects and tasks.
Abstract
We propose DINOBot, a novel imitation learning framework for robot manipulation, which leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers trained with DINO. When interacting with a novel object, DINOBot first uses these features to retrieve the most visually similar object experienced during human demonstrations, and then uses this object to align its end-effector with the novel object to enable effective interaction. Through a series of real-world experiments on everyday tasks, we show that exploiting both the image-level and pixel-level properties of vision foundation models enables unprecedented learning efficiency and generalisation. Videos and code are available at https://www.robot-learning.uk/dinobot.
