Table of Contents
Fetching ...

Object-Centric World Model for Language-Guided Manipulation

Youngjoon Jeong, Junha Chun, Soonwoo Cha, Taesup Kim

TL;DR

This work addresses the high computational cost of diffusion-based video generators in language-guided world modeling by introducing an object-centric world model that operates in a compact slot space. It combines SAVi-based slot extraction with a language-conditioned transformer (LSlotFormer) to predict future object states and uses a transformer-based action decoder trained via behavioral cloning to achieve manipulation tasks without goal images. The approach surpasses diffusion-based baselines in visuo-linguo-motor control on the LangTable dataset, delivering better sample efficiency and faster training/inference, while demonstrating generalization to unseen blocks and tasks. The results highlight the practical significance of language-guided, object-centric representations for robust, efficient robotics perception and control.

Abstract

A world model is essential for an agent to predict the future and plan in domains such as autonomous driving and robotics. To achieve this, recent advancements have focused on video generation, which has gained significant attention due to the impressive success of diffusion models. However, these models require substantial computational resources. To address these challenges, we propose a world model leveraging object-centric representation space using slot attention, guided by language instructions. Our model perceives the current state as an object-centric representation and predicts future states in this representation space conditioned on natural language instructions. This approach results in a more compact and computationally efficient model compared to diffusion-based generative alternatives. Furthermore, it flexibly predicts future states based on language instructions, and offers a significant advantage in manipulation tasks where object recognition is crucial. In this paper, we demonstrate that our latent predictive world model surpasses generative world models in visuo-linguo-motor control tasks, achieving superior sample and computation efficiency. We also investigate the generalization performance of the proposed method and explore various strategies for predicting actions using object-centric representations.

Object-Centric World Model for Language-Guided Manipulation

TL;DR

This work addresses the high computational cost of diffusion-based video generators in language-guided world modeling by introducing an object-centric world model that operates in a compact slot space. It combines SAVi-based slot extraction with a language-conditioned transformer (LSlotFormer) to predict future object states and uses a transformer-based action decoder trained via behavioral cloning to achieve manipulation tasks without goal images. The approach surpasses diffusion-based baselines in visuo-linguo-motor control on the LangTable dataset, delivering better sample efficiency and faster training/inference, while demonstrating generalization to unseen blocks and tasks. The results highlight the practical significance of language-guided, object-centric representations for robust, efficient robotics perception and control.

Abstract

A world model is essential for an agent to predict the future and plan in domains such as autonomous driving and robotics. To achieve this, recent advancements have focused on video generation, which has gained significant attention due to the impressive success of diffusion models. However, these models require substantial computational resources. To address these challenges, we propose a world model leveraging object-centric representation space using slot attention, guided by language instructions. Our model perceives the current state as an object-centric representation and predicts future states in this representation space conditioned on natural language instructions. This approach results in a more compact and computationally efficient model compared to diffusion-based generative alternatives. Furthermore, it flexibly predicts future states based on language instructions, and offers a significant advantage in manipulation tasks where object recognition is crucial. In this paper, we demonstrate that our latent predictive world model surpasses generative world models in visuo-linguo-motor control tasks, achieving superior sample and computation efficiency. We also investigate the generalization performance of the proposed method and explore various strategies for predicting actions using object-centric representations.

Paper Structure

This paper contains 42 sections, 2 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Training and inferencing overview of our world model. (a) During training, frames are processed through a pre-trained slot encoder to extract slots, and language instructions are processed using a sentence encoder. Slots along with the instruction representation, are used to condition the world model, which predicts the slots for future states. These predicted slots are then compared with the future ground truth slots extracted from the frames, and a reconstruction loss is computed to train the world model. (b) More specifically, the world model utilizes the predicted slots from the previous steps to autoregressively predict future slots.
  • Figure 2: Overview of the action decoder training process. (a) The action decoder is trained by inputting the current state slots and future state slots obtained from the trained world model to predict actions. (b) The detailed architecture of the action decoder is as follows: input slots are grouped by timestep and passed through a projection layer with shared weights, followed by a transformer encoder. The outputs are then concatenated in chronological order and fed into a pooling layer to predict the action.
  • Figure 3: Decoded video frames of our method, Seer, and decoded frame of Susie, conditioned on given reference frames and the instruction, 'Move the cube towards the moon.' Seer-F produces higher quality generations compared to Seer-S, but both methods fail to predict states guided by the instruction. Susie successfully generates the future frame conditioned on the reference frame and the instruction. Seer results are generated using 30 DDIM sampler steps and Susie uses 10 steps.
  • Figure 4: Predicted action trajectory of ours and the baseline in visuo-linguo-motor control simulation environment. When given the instructions, our method successfully recognizes and moves the correct blocks to complete the episodes, while the baselines fail.
  • Figure 5: Qualitative visualization of slots learned by Slot Attention for Videos (SAVi) and our world model, LSlotFormer. In the top section, SAVi effectively segments scene frames into individual slots. In the bottom section, LSlotFormer uses language guidance to predict future states in slot form, with the decoded slots maintaining the structure learned by SAVi, showing consistency in representation.
  • ...and 4 more figures