Entity-Centric Reinforcement Learning for Object Manipulation from Pixels
Dan Haramati, Tal Daniel, Aviv Tamar
TL;DR
This work tackles pixel-based, goal-conditioned manipulation of multiple objects by integrating an unsupervised object-centric representation (DLP) with a Transformer-based Entity Interaction Transformer to model inter-object relations and multi-view cues. A novel Chamfer-based reward (GDAC) enables end-to-end learning from images, while a theoretical analysis shows that self-attention Q-functions can achieve compositional generalization, supporting zero-shot transfer to more objects than seen during training. Empirically, the approach enables accurate manipulation involving up to 3 training objects to generalize to 6–10+ objects, and it outperforms baselines on interaction-heavy tasks where object relations matter. The combination of OCR for structured perception and EIT for relational reasoning offers a scalable path toward robust, pixel-based multi-object manipulation with strong generalization and sample efficiency.
Abstract
Manipulating objects is a hallmark of human intelligence, and an important task in domains such as robotics. In principle, Reinforcement Learning (RL) offers a general approach to learn object manipulation. In practice, however, domains with more than a few objects are difficult for RL agents due to the curse of dimensionality, especially when learning from raw image observations. In this work we propose a structured approach for visual RL that is suitable for representing multiple objects and their interaction, and use it to learn goal-conditioned manipulation of several objects. Key to our method is the ability to handle goals with dependencies between the objects (e.g., moving objects in a certain order). We further relate our architecture to the generalization capability of the trained agent, based on a theoretical result for compositional generalization, and demonstrate agents that learn with 3 objects but generalize to similar tasks with over 10 objects. Videos and code are available on the project website: https://sites.google.com/view/entity-centric-rl
