Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination
Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro, Samuele Papa, Stefano Ghidoni, Efstratios Gavves
TL;DR
The paper tackles the brittleness of traditional world models in robotics by introducing an explicit, compositional, object-centric framework. DreMa builds a learnable digital twin $\mathcal{M}= ( \mathcal{O}_t, \mathcal{A}_t, D, \mathcal{T})$ using object-centric Gaussian Splats and a physics engine to render imagined futures and predict action consequences. It then leverages equivariant transformations to generate high-quality, novel demonstrations from a small set of originals, enabling one-shot and few-shot imitation learning that generalizes across task variations and object configurations. Empirical results in simulation and on a real Franka Panda robot show improved accuracy, robustness, and data efficiency, including a notable one-shot learning demonstration, highlighting the practical impact of combining explicit world models with learnable digital twins for robotic manipulation.
Abstract
A world model provides an agent with a representation of its environment, enabling it to predict the causal consequences of its actions. Current world models typically cannot directly and explicitly imitate the actual environment in front of a robot, often resulting in unrealistic behaviors and hallucinations that make them unsuitable for real-world robotics applications. To overcome those challenges, we propose to rethink robot world models as learnable digital twins. We introduce DreMa, a new approach for constructing digital twins automatically using learned explicit representations of the real world and its dynamics, bridging the gap between traditional digital twins and world models. DreMa replicates the observed world and its structure by integrating Gaussian Splatting and physics simulators, allowing robots to imagine novel configurations of objects and to predict the future consequences of robot actions thanks to its compositionality. We leverage this capability to generate new data for imitation learning by applying equivariant transformations to a small set of demonstrations. Our evaluations across various settings demonstrate significant improvements in accuracy and robustness by incrementing actions and object distributions, reducing the data needed to learn a policy and improving the generalization of the agents. As a highlight, we show that a real Franka Emika Panda robot, powered by DreMa's imagination, can successfully learn novel physical tasks from just a single example per task variation (one-shot policy learning). Our project page can be found in: https://dreamtomanipulate.github.io/.
