Table of Contents
Fetching ...

DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models

Norman Di Palo, Edward Johns

TL;DR

DINOBot introduces a retrieval-plus-alignment imitation-learning framework that leverages DINO-ViT features to generalise robot manipulation to novel objects from minimal demonstrations. By separating semantic retrieval (image-level) from geometric alignment (pixel-level) and replaying demonstrated trajectories, it achieves superior data efficiency and generalisation compared to existing baselines. Extensive real-world experiments across tabletop and kitchen tasks show strong one-shot performance and resilience to distractors, highlighting the practical value of combining image-level and pixel-level visual reasoning in robotics. The approach emphasizes explicit, modular reasoning over end-to-end policies, enabling scalable adaptation to new objects and tasks.

Abstract

We propose DINOBot, a novel imitation learning framework for robot manipulation, which leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers trained with DINO. When interacting with a novel object, DINOBot first uses these features to retrieve the most visually similar object experienced during human demonstrations, and then uses this object to align its end-effector with the novel object to enable effective interaction. Through a series of real-world experiments on everyday tasks, we show that exploiting both the image-level and pixel-level properties of vision foundation models enables unprecedented learning efficiency and generalisation. Videos and code are available at https://www.robot-learning.uk/dinobot.

DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models

TL;DR

DINOBot introduces a retrieval-plus-alignment imitation-learning framework that leverages DINO-ViT features to generalise robot manipulation to novel objects from minimal demonstrations. By separating semantic retrieval (image-level) from geometric alignment (pixel-level) and replaying demonstrated trajectories, it achieves superior data efficiency and generalisation compared to existing baselines. Extensive real-world experiments across tabletop and kitchen tasks show strong one-shot performance and resilience to distractors, highlighting the practical value of combining image-level and pixel-level visual reasoning in robotics. The approach emphasizes explicit, modular reasoning over end-to-end policies, enabling scalable adaptation to new objects and tasks.

Abstract

We propose DINOBot, a novel imitation learning framework for robot manipulation, which leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers trained with DINO. When interacting with a novel object, DINOBot first uses these features to retrieve the most visually similar object experienced during human demonstrations, and then uses this object to align its end-effector with the novel object to enable effective interaction. Through a series of real-world experiments on everyday tasks, we show that exploiting both the image-level and pixel-level properties of vision foundation models enables unprecedented learning efficiency and generalisation. Videos and code are available at https://www.robot-learning.uk/dinobot.
Paper Structure (7 sections, 10 figures, 1 table)

This paper contains 7 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: From a single demo, DINOBot can learn to adapt to new objects, be robust to distractors, execute multi-stage long horizon tasks, and interact with complex environments.
  • Figure 2: Overall illustration of our framework. Upon observing a new object, the robot visually compares it other objects observed during demonstrations to find the most similar object (semantic, image-level reasoning), and retrieve both its image and the trajectory executed on that object. Then, the robot aligns its end-effector with this image (spatial, pixel-level reasoning), before then executing that trajectory. These two phases of reasoning are both based on extracting and matching DINO-ViT features.
  • Figure 3: In each column, given a live image (top) DINOBot retrieves from the buffer the most similar image in the buffer (bottom), and finds correspondences between the two.
  • Figure 4: Success rates on each object for all methods.
  • Figure 5: Results on kitchen tasks.
  • ...and 5 more figures