Disentangled Object-Centric Image Representation for Robotic Manipulation
David Emukpere, Romain Deffayet, Bingbing Wu, Romain Brégier, Michael Niemaz, Jean-Luc Meunier, Denys Proux, Jean-Michel Renders, Seungsu Kim
TL;DR
This work tackles the generalization gap in vision-driven robotic manipulation by introducing DOCIR, a disentangled object-centric representation that separately encodes robot embodiment, target objects, and obstacles across two camera views. By applying segmentation-derived masks to produce four-channel inputs and using a shared CNN encoder to generate per-group encodings that are fused into two-view scene representations, the approach trains PPO-based policies for multi-object pick-and-place within a Markov Decision Process $oldsymbol{\mathcal{M}} = \langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R} \rangle$ with $a \in \mathbb{R}^4$, split into $a_{\text{arm}}$ and $a_{\text{gripper}}$. Empirical results in simulation and real-world setups show DOCIR achieving state-of-the-art performance, robust generalization to unseen objects and distractors, and strong sim-to-real transfer, outperforming flat representations and OCR baselines especially as scene complexity grows. The method highlights the value of structured, semantic disentanglement for robust skill learning in cluttered environments and points to future opportunities in open-world segmentation, integration with pre-trained models, imitation learning, and high-level policy composition for long-horizon tasks.
Abstract
Learning robotic manipulation skills from vision is a promising approach for developing robotics applications that can generalize broadly to real-world scenarios. As such, many approaches to enable this vision have been explored with fruitful results. Particularly, object-centric representation methods have been shown to provide better inductive biases for skill learning, leading to improved performance and generalization. Nonetheless, we show that object-centric methods can struggle to learn simple manipulation skills in multi-object environments. Thus, we propose DOCIR, an object-centric framework that introduces a disentangled representation for objects of interest, obstacles, and robot embodiment. We show that this approach leads to state-of-the-art performance for learning pick and place skills from visual inputs in multi-object environments and generalizes at test time to changing objects of interest and distractors in the scene. Furthermore, we show its efficacy both in simulation and zero-shot transfer to the real world.
