Composing Pre-Trained Object-Centric Representations for Robotics From "What" and "Where" Foundation Models

Junyao Shi; Jianing Qian; Yecheng Jason Ma; Dinesh Jayaraman

Composing Pre-Trained Object-Centric Representations for Robotics From "What" and "Where" Foundation Models

Junyao Shi, Jianing Qian, Yecheng Jason Ma, Dinesh Jayaraman

TL;DR

POCR tackles the challenge of learning robust robotic manipulation from general visual data by composing pre-trained what- and where- foundation models into a unified object-centric representation. The framework uses a Segmentation-based 'where' component (e.g., SAM) to produce stable object masks and binds them to a fixed slot structure, while a pre-trained 'what' encoder provides per-slot feature vectors for control. A self-attentive policy operates over the slot representations to imitate expert actions, enabling off-the-shelf, instruction-free learning that generalizes to unseen objects and scenes. Across RLBench simulations and real-world kitchen setups, POCR with SAM for localization and LIV for visual content consistently outperforms prior representations and demonstrates significant systematic generalization, validating the practicality of plug-and-play object-centric representations in robotics.

Abstract

There have recently been large advances both in pre-training visual representations for robotic control and segmenting unknown category objects in general images. To leverage these for improved robot learning, we propose $\textbf{POCR}$, a new framework for building pre-trained object-centric representations for robotic control. Building on theories of "what-where" representations in psychology and computer vision, we use segmentations from a pre-trained model to stably locate across timesteps, various entities in the scene, capturing "where" information. To each such segmented entity, we apply other pre-trained models that build vector descriptions suitable for robotic control tasks, thus capturing "what" the entity is. Thus, our pre-trained object-centric representations for control are constructed by appropriately combining the outputs of off-the-shelf pre-trained models, with no new training. On various simulated and real robotic tasks, we show that imitation policies for robotic manipulators trained on POCR achieve better performance and systematic generalization than state of the art pre-trained representations for robotics, as well as prior object-centric representations that are typically trained from scratch.

Composing Pre-Trained Object-Centric Representations for Robotics From "What" and "Where" Foundation Models

TL;DR

Abstract

, a new framework for building pre-trained object-centric representations for robotic control. Building on theories of "what-where" representations in psychology and computer vision, we use segmentations from a pre-trained model to stably locate across timesteps, various entities in the scene, capturing "where" information. To each such segmented entity, we apply other pre-trained models that build vector descriptions suitable for robotic control tasks, thus capturing "what" the entity is. Thus, our pre-trained object-centric representations for control are constructed by appropriately combining the outputs of off-the-shelf pre-trained models, with no new training. On various simulated and real robotic tasks, we show that imitation policies for robotic manipulators trained on POCR achieve better performance and systematic generalization than state of the art pre-trained representations for robotics, as well as prior object-centric representations that are typically trained from scratch.

Paper Structure (31 sections, 1 equation, 20 figures, 8 tables)

This paper contains 31 sections, 1 equation, 20 figures, 8 tables.

Introduction
Problem Setup and Background
Pre-Trained Object Centric Representations For Robotic Manipulation
The Where: Localizing and Assigning Objects to Slots
The What: Representing The Image Contents in Each Slot
The How: Learning Robot Manipulation Policies from Demonstrations with POCR
Other Related Work
Experimental Results
Simulation Experiments Setup
Investigating "Where" Representations for POCR
Investigating "What" Representations for POCR
How Does POCR Compare to SOTA Representations?
Real-World Experiments
Systematic Generalization Experiments
Limitation and Future Works
...and 16 more sections

Figures (20)

Figure 1: POCR: Pre-Trained Object-Centric Representations for Robotics by chaining "what" and "where" foundation models. The "where" foundation model produces a set of masks representing objects in the scene. Slot binding selects which among them to bind to the slots in our OCE. Image contents in each slot are represented by the "what" foundation model and their mask bounding box coordinates. The robot learns policies over slot representations.
Figure 2: POCR segmentation results over demonstrations.
Figure 3: Evaluation Environments.
Figure 4: Real-World Policy Rollouts.
Figure 5: Systematic Generalization Evaluation Environment. Figure \ref{['fig:real_new_distractor']}: green pear is the new distractor fruit; Figure \ref{['fig:real_new_background']}: blue cloth serves as new background; Figure \ref{['fig:sim_new_distractors']} Carrot and banana are new distractors.
...and 15 more figures

Composing Pre-Trained Object-Centric Representations for Robotics From "What" and "Where" Foundation Models

TL;DR

Abstract

Composing Pre-Trained Object-Centric Representations for Robotics From "What" and "Where" Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (20)