Table of Contents
Fetching ...

Entity-Centric Reinforcement Learning for Object Manipulation from Pixels

Dan Haramati, Tal Daniel, Aviv Tamar

TL;DR

This work tackles pixel-based, goal-conditioned manipulation of multiple objects by integrating an unsupervised object-centric representation (DLP) with a Transformer-based Entity Interaction Transformer to model inter-object relations and multi-view cues. A novel Chamfer-based reward (GDAC) enables end-to-end learning from images, while a theoretical analysis shows that self-attention Q-functions can achieve compositional generalization, supporting zero-shot transfer to more objects than seen during training. Empirically, the approach enables accurate manipulation involving up to 3 training objects to generalize to 6–10+ objects, and it outperforms baselines on interaction-heavy tasks where object relations matter. The combination of OCR for structured perception and EIT for relational reasoning offers a scalable path toward robust, pixel-based multi-object manipulation with strong generalization and sample efficiency.

Abstract

Manipulating objects is a hallmark of human intelligence, and an important task in domains such as robotics. In principle, Reinforcement Learning (RL) offers a general approach to learn object manipulation. In practice, however, domains with more than a few objects are difficult for RL agents due to the curse of dimensionality, especially when learning from raw image observations. In this work we propose a structured approach for visual RL that is suitable for representing multiple objects and their interaction, and use it to learn goal-conditioned manipulation of several objects. Key to our method is the ability to handle goals with dependencies between the objects (e.g., moving objects in a certain order). We further relate our architecture to the generalization capability of the trained agent, based on a theoretical result for compositional generalization, and demonstrate agents that learn with 3 objects but generalize to similar tasks with over 10 objects. Videos and code are available on the project website: https://sites.google.com/view/entity-centric-rl

Entity-Centric Reinforcement Learning for Object Manipulation from Pixels

TL;DR

This work tackles pixel-based, goal-conditioned manipulation of multiple objects by integrating an unsupervised object-centric representation (DLP) with a Transformer-based Entity Interaction Transformer to model inter-object relations and multi-view cues. A novel Chamfer-based reward (GDAC) enables end-to-end learning from images, while a theoretical analysis shows that self-attention Q-functions can achieve compositional generalization, supporting zero-shot transfer to more objects than seen during training. Empirically, the approach enables accurate manipulation involving up to 3 training objects to generalize to 6–10+ objects, and it outperforms baselines on interaction-heavy tasks where object relations matter. The combination of OCR for structured perception and EIT for relational reasoning offers a scalable path toward robust, pixel-based multi-object manipulation with strong generalization and sample efficiency.

Abstract

Manipulating objects is a hallmark of human intelligence, and an important task in domains such as robotics. In principle, Reinforcement Learning (RL) offers a general approach to learn object manipulation. In practice, however, domains with more than a few objects are difficult for RL agents due to the curse of dimensionality, especially when learning from raw image observations. In this work we propose a structured approach for visual RL that is suitable for representing multiple objects and their interaction, and use it to learn goal-conditioned manipulation of several objects. Key to our method is the ability to handle goals with dependencies between the objects (e.g., moving objects in a certain order). We further relate our architecture to the generalization capability of the trained agent, based on a theoretical result for compositional generalization, and demonstrate agents that learn with 3 objects but generalize to similar tasks with over 10 objects. Videos and code are available on the project website: https://sites.google.com/view/entity-centric-rl
Paper Structure (42 sections, 7 theorems, 61 equations, 16 figures, 11 tables)

This paper contains 42 sections, 7 theorems, 61 equations, 16 figures, 11 tables.

Key Result

Theorem 2

Let Assumption ass:q-structure hold. Let $\hat{Q}$ be an approximation of $Q^{*}$ with the same structure. Assume that $\forall s\in\mathcal{S}^N,\,\forall a\in\mathcal{A},\ \forall N\in\left[1,M\right]$ we have $\left|\hat{Q}\left(s_{1},...,s_{N},a\right)-Q^{*}\left(s_{1},...,s_{N},a\right)\right|<

Figures (16)

  • Figure 1: The environment we used for our experiments (left) and how the agent perceives it (middle, right), colored keypoints are the position attribute $z_p$ of particles from the DLP representation.
  • Figure 2: Outline of the Entity Interaction Transformer (EIT) - Sets of state and goal particles from multiple views with an additive view encoding are input to a sequence of Transformer blocks. For the Q-function, an action particle is added. We condition on goals with cross-attention. Attention-based aggregation reduces the set to a single vector, followed by an MLP that produces the final output.
  • Figure 3: The simulated environments used for experiments in this work.
  • Figure 4: Success Rate vs. Environment Timesteps -- Values calculated on $96$ randomly sampled goals. Methods with input type 'State' are presented in dashed lines and learn from GT state observations, otherwise, from images. Our method performs better than or equivalently to the best performing baseline in each category (state/image-based). In the environments requiring object interaction ((d), (f)), our method achieves significantly better performance than SMORL. Notably, our image-based method matches/surpasses state-based SMORL.
  • Figure 5: Left -- Rollout of an agent trained on the Push-2T task. Right -- Distribution of object angle difference (radians) from goal. Values of $400$ episodes with randomly initialized goal and initial configurations.
  • ...and 11 more figures

Theorems & Definitions (14)

  • Theorem 2
  • Definition 3
  • Theorem 4
  • proof
  • Lemma 5
  • proof
  • Lemma 6
  • proof
  • Lemma 7
  • proof
  • ...and 4 more