Table of Contents
Fetching ...

Composing Pre-Trained Object-Centric Representations for Robotics From "What" and "Where" Foundation Models

Junyao Shi, Jianing Qian, Yecheng Jason Ma, Dinesh Jayaraman

TL;DR

POCR tackles the challenge of learning robust robotic manipulation from general visual data by composing pre-trained what- and where- foundation models into a unified object-centric representation. The framework uses a Segmentation-based 'where' component (e.g., SAM) to produce stable object masks and binds them to a fixed slot structure, while a pre-trained 'what' encoder provides per-slot feature vectors for control. A self-attentive policy operates over the slot representations to imitate expert actions, enabling off-the-shelf, instruction-free learning that generalizes to unseen objects and scenes. Across RLBench simulations and real-world kitchen setups, POCR with SAM for localization and LIV for visual content consistently outperforms prior representations and demonstrates significant systematic generalization, validating the practicality of plug-and-play object-centric representations in robotics.

Abstract

There have recently been large advances both in pre-training visual representations for robotic control and segmenting unknown category objects in general images. To leverage these for improved robot learning, we propose $\textbf{POCR}$, a new framework for building pre-trained object-centric representations for robotic control. Building on theories of "what-where" representations in psychology and computer vision, we use segmentations from a pre-trained model to stably locate across timesteps, various entities in the scene, capturing "where" information. To each such segmented entity, we apply other pre-trained models that build vector descriptions suitable for robotic control tasks, thus capturing "what" the entity is. Thus, our pre-trained object-centric representations for control are constructed by appropriately combining the outputs of off-the-shelf pre-trained models, with no new training. On various simulated and real robotic tasks, we show that imitation policies for robotic manipulators trained on POCR achieve better performance and systematic generalization than state of the art pre-trained representations for robotics, as well as prior object-centric representations that are typically trained from scratch.

Composing Pre-Trained Object-Centric Representations for Robotics From "What" and "Where" Foundation Models

TL;DR

POCR tackles the challenge of learning robust robotic manipulation from general visual data by composing pre-trained what- and where- foundation models into a unified object-centric representation. The framework uses a Segmentation-based 'where' component (e.g., SAM) to produce stable object masks and binds them to a fixed slot structure, while a pre-trained 'what' encoder provides per-slot feature vectors for control. A self-attentive policy operates over the slot representations to imitate expert actions, enabling off-the-shelf, instruction-free learning that generalizes to unseen objects and scenes. Across RLBench simulations and real-world kitchen setups, POCR with SAM for localization and LIV for visual content consistently outperforms prior representations and demonstrates significant systematic generalization, validating the practicality of plug-and-play object-centric representations in robotics.

Abstract

There have recently been large advances both in pre-training visual representations for robotic control and segmenting unknown category objects in general images. To leverage these for improved robot learning, we propose , a new framework for building pre-trained object-centric representations for robotic control. Building on theories of "what-where" representations in psychology and computer vision, we use segmentations from a pre-trained model to stably locate across timesteps, various entities in the scene, capturing "where" information. To each such segmented entity, we apply other pre-trained models that build vector descriptions suitable for robotic control tasks, thus capturing "what" the entity is. Thus, our pre-trained object-centric representations for control are constructed by appropriately combining the outputs of off-the-shelf pre-trained models, with no new training. On various simulated and real robotic tasks, we show that imitation policies for robotic manipulators trained on POCR achieve better performance and systematic generalization than state of the art pre-trained representations for robotics, as well as prior object-centric representations that are typically trained from scratch.
Paper Structure (31 sections, 1 equation, 20 figures, 8 tables)

This paper contains 31 sections, 1 equation, 20 figures, 8 tables.

Figures (20)

  • Figure 1: POCR: Pre-Trained Object-Centric Representations for Robotics by chaining "what" and "where" foundation models. The "where" foundation model produces a set of masks representing objects in the scene. Slot binding selects which among them to bind to the slots in our OCE. Image contents in each slot are represented by the "what" foundation model and their mask bounding box coordinates. The robot learns policies over slot representations.
  • Figure 2: POCR segmentation results over demonstrations.
  • Figure 3: Evaluation Environments.
  • Figure 4: Real-World Policy Rollouts.
  • Figure 5: Systematic Generalization Evaluation Environment. Figure \ref{['fig:real_new_distractor']}: green pear is the new distractor fruit; Figure \ref{['fig:real_new_background']}: blue cloth serves as new background; Figure \ref{['fig:sim_new_distractors']} Carrot and banana are new distractors.
  • ...and 15 more figures