Table of Contents
Fetching ...

Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination

Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro, Samuele Papa, Stefano Ghidoni, Efstratios Gavves

TL;DR

The paper tackles the brittleness of traditional world models in robotics by introducing an explicit, compositional, object-centric framework. DreMa builds a learnable digital twin $\mathcal{M}= ( \mathcal{O}_t, \mathcal{A}_t, D, \mathcal{T})$ using object-centric Gaussian Splats and a physics engine to render imagined futures and predict action consequences. It then leverages equivariant transformations to generate high-quality, novel demonstrations from a small set of originals, enabling one-shot and few-shot imitation learning that generalizes across task variations and object configurations. Empirical results in simulation and on a real Franka Panda robot show improved accuracy, robustness, and data efficiency, including a notable one-shot learning demonstration, highlighting the practical impact of combining explicit world models with learnable digital twins for robotic manipulation.

Abstract

A world model provides an agent with a representation of its environment, enabling it to predict the causal consequences of its actions. Current world models typically cannot directly and explicitly imitate the actual environment in front of a robot, often resulting in unrealistic behaviors and hallucinations that make them unsuitable for real-world robotics applications. To overcome those challenges, we propose to rethink robot world models as learnable digital twins. We introduce DreMa, a new approach for constructing digital twins automatically using learned explicit representations of the real world and its dynamics, bridging the gap between traditional digital twins and world models. DreMa replicates the observed world and its structure by integrating Gaussian Splatting and physics simulators, allowing robots to imagine novel configurations of objects and to predict the future consequences of robot actions thanks to its compositionality. We leverage this capability to generate new data for imitation learning by applying equivariant transformations to a small set of demonstrations. Our evaluations across various settings demonstrate significant improvements in accuracy and robustness by incrementing actions and object distributions, reducing the data needed to learn a policy and improving the generalization of the agents. As a highlight, we show that a real Franka Emika Panda robot, powered by DreMa's imagination, can successfully learn novel physical tasks from just a single example per task variation (one-shot policy learning). Our project page can be found in: https://dreamtomanipulate.github.io/.

Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination

TL;DR

The paper tackles the brittleness of traditional world models in robotics by introducing an explicit, compositional, object-centric framework. DreMa builds a learnable digital twin using object-centric Gaussian Splats and a physics engine to render imagined futures and predict action consequences. It then leverages equivariant transformations to generate high-quality, novel demonstrations from a small set of originals, enabling one-shot and few-shot imitation learning that generalizes across task variations and object configurations. Empirical results in simulation and on a real Franka Panda robot show improved accuracy, robustness, and data efficiency, including a notable one-shot learning demonstration, highlighting the practical impact of combining explicit world models with learnable digital twins for robotic manipulation.

Abstract

A world model provides an agent with a representation of its environment, enabling it to predict the causal consequences of its actions. Current world models typically cannot directly and explicitly imitate the actual environment in front of a robot, often resulting in unrealistic behaviors and hallucinations that make them unsuitable for real-world robotics applications. To overcome those challenges, we propose to rethink robot world models as learnable digital twins. We introduce DreMa, a new approach for constructing digital twins automatically using learned explicit representations of the real world and its dynamics, bridging the gap between traditional digital twins and world models. DreMa replicates the observed world and its structure by integrating Gaussian Splatting and physics simulators, allowing robots to imagine novel configurations of objects and to predict the future consequences of robot actions thanks to its compositionality. We leverage this capability to generate new data for imitation learning by applying equivariant transformations to a small set of demonstrations. Our evaluations across various settings demonstrate significant improvements in accuracy and robustness by incrementing actions and object distributions, reducing the data needed to learn a policy and improving the generalization of the agents. As a highlight, we show that a real Franka Emika Panda robot, powered by DreMa's imagination, can successfully learn novel physical tasks from just a single example per task variation (one-shot policy learning). Our project page can be found in: https://dreamtomanipulate.github.io/.

Paper Structure

This paper contains 83 sections, 1 equation, 20 figures, 6 tables.

Figures (20)

  • Figure 1: Overview of imagination with DreMa, which builds a compositional manipulation world model from environment images using object-centric Gaussian Splatting to generate novel demonstrations by transforming real ones.
  • Figure 2: Steps to create the compositional world model with DreMa: observation of the environment and scene decomposition, representation extraction and future predictions.
  • Figure 3: The effect of equivariant translation, equivariant rotation, and the object rotation transformations. Top row: start of demonstration. Bottom row: target of demonstration.
  • Figure 4: Imagined demonstrations keep improving imitation learning even with increasing number of original data.
  • Figure 5: Original (top) and imagined demonstration (bottom) after a $90^\circ$ rotation transformation.
  • ...and 15 more figures