Table of Contents
Fetching ...

Disentangled Object-Centric Image Representation for Robotic Manipulation

David Emukpere, Romain Deffayet, Bingbing Wu, Romain Brégier, Michael Niemaz, Jean-Luc Meunier, Denys Proux, Jean-Michel Renders, Seungsu Kim

TL;DR

This work tackles the generalization gap in vision-driven robotic manipulation by introducing DOCIR, a disentangled object-centric representation that separately encodes robot embodiment, target objects, and obstacles across two camera views. By applying segmentation-derived masks to produce four-channel inputs and using a shared CNN encoder to generate per-group encodings that are fused into two-view scene representations, the approach trains PPO-based policies for multi-object pick-and-place within a Markov Decision Process $oldsymbol{\mathcal{M}} = \langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R} \rangle$ with $a \in \mathbb{R}^4$, split into $a_{\text{arm}}$ and $a_{\text{gripper}}$. Empirical results in simulation and real-world setups show DOCIR achieving state-of-the-art performance, robust generalization to unseen objects and distractors, and strong sim-to-real transfer, outperforming flat representations and OCR baselines especially as scene complexity grows. The method highlights the value of structured, semantic disentanglement for robust skill learning in cluttered environments and points to future opportunities in open-world segmentation, integration with pre-trained models, imitation learning, and high-level policy composition for long-horizon tasks.

Abstract

Learning robotic manipulation skills from vision is a promising approach for developing robotics applications that can generalize broadly to real-world scenarios. As such, many approaches to enable this vision have been explored with fruitful results. Particularly, object-centric representation methods have been shown to provide better inductive biases for skill learning, leading to improved performance and generalization. Nonetheless, we show that object-centric methods can struggle to learn simple manipulation skills in multi-object environments. Thus, we propose DOCIR, an object-centric framework that introduces a disentangled representation for objects of interest, obstacles, and robot embodiment. We show that this approach leads to state-of-the-art performance for learning pick and place skills from visual inputs in multi-object environments and generalizes at test time to changing objects of interest and distractors in the scene. Furthermore, we show its efficacy both in simulation and zero-shot transfer to the real world.

Disentangled Object-Centric Image Representation for Robotic Manipulation

TL;DR

This work tackles the generalization gap in vision-driven robotic manipulation by introducing DOCIR, a disentangled object-centric representation that separately encodes robot embodiment, target objects, and obstacles across two camera views. By applying segmentation-derived masks to produce four-channel inputs and using a shared CNN encoder to generate per-group encodings that are fused into two-view scene representations, the approach trains PPO-based policies for multi-object pick-and-place within a Markov Decision Process with , split into and . Empirical results in simulation and real-world setups show DOCIR achieving state-of-the-art performance, robust generalization to unseen objects and distractors, and strong sim-to-real transfer, outperforming flat representations and OCR baselines especially as scene complexity grows. The method highlights the value of structured, semantic disentanglement for robust skill learning in cluttered environments and points to future opportunities in open-world segmentation, integration with pre-trained models, imitation learning, and high-level policy composition for long-horizon tasks.

Abstract

Learning robotic manipulation skills from vision is a promising approach for developing robotics applications that can generalize broadly to real-world scenarios. As such, many approaches to enable this vision have been explored with fruitful results. Particularly, object-centric representation methods have been shown to provide better inductive biases for skill learning, leading to improved performance and generalization. Nonetheless, we show that object-centric methods can struggle to learn simple manipulation skills in multi-object environments. Thus, we propose DOCIR, an object-centric framework that introduces a disentangled representation for objects of interest, obstacles, and robot embodiment. We show that this approach leads to state-of-the-art performance for learning pick and place skills from visual inputs in multi-object environments and generalizes at test time to changing objects of interest and distractors in the scene. Furthermore, we show its efficacy both in simulation and zero-shot transfer to the real world.

Paper Structure

This paper contains 14 sections, 1 equation, 8 figures, 3 tables.

Figures (8)

  • Figure 1: DOCIR overview. We introduce an object-centric disentanglement scheme for multi-view visual robotic manipulation to improve learning performance, efficiency, and generalization of robotic manipulation skills.
  • Figure 2: Example DOCIR segmentation. Observation recovered by our disentanglement procedure in simulation and real-world environments. In each environment, the two rows show the base and wrist camera views, respectively. Leftmost, is the full image, followed by the masked images for the robot, object of interest, and obstacles.
  • Figure 3: Multi-object simulation environment. (Left-to-right): full scene view, base camera view, wrist camera view.
  • Figure 4: Fixed target skill learning curves. We report a rolling average of training performance, aggregated over $3$ seeded runs. In the fixed target environment, the object of interest is the same across all episodes.
  • Figure 5: Varying target skill learning curves. We report a rolling average of training performance, aggregated over $3$ seeded runs. In the varying target environment, the object of interest is different in every new episode.
  • ...and 3 more figures