Table of Contents
Fetching ...

LIMT: Language-Informed Multi-Task Visual World Models

Elie Aljalbout, Nikolaos Sotirakis, Patrick van der Smagt, Maximilian Karl, Nutan Chen

TL;DR

This work introduces LIMT, a model-based, language-conditioned approach to multi-task visual world modeling for robotics. It combines a discrete image-proprioception tokenizer, a pre-trained language encoder, and a transformer-based dynamics model to condition both the world model and policy on semantic task directives. Through staged offline-online training and latent imagination, LIMT achieves superior sample efficiency and multi-task performance, demonstrating the value of language embeddings for sharing dynamics and enabling task switching. The results on CALVIN show substantial gains over model-free baselines and highlight practical benefits for real-world, multi-task robotic manipulation.

Abstract

Most recent successes in robot reinforcement learning involve learning a specialized single-task agent. However, robots capable of performing multiple tasks can be much more valuable in real-world applications. Multi-task reinforcement learning can be very challenging due to the increased sample complexity and the potentially conflicting task objectives. Previous work on this topic is dominated by model-free approaches. The latter can be very sample inefficient even when learning specialized single-task agents. In this work, we focus on model-based multi-task reinforcement learning. We propose a method for learning multi-task visual world models, leveraging pre-trained language models to extract semantically meaningful task representations. These representations are used by the world model and policy to reason about task similarity in dynamics and behavior. Our results highlight the benefits of using language-driven task representations for world models and a clear advantage of model-based multi-task learning over the more common model-free paradigm.

LIMT: Language-Informed Multi-Task Visual World Models

TL;DR

This work introduces LIMT, a model-based, language-conditioned approach to multi-task visual world modeling for robotics. It combines a discrete image-proprioception tokenizer, a pre-trained language encoder, and a transformer-based dynamics model to condition both the world model and policy on semantic task directives. Through staged offline-online training and latent imagination, LIMT achieves superior sample efficiency and multi-task performance, demonstrating the value of language embeddings for sharing dynamics and enabling task switching. The results on CALVIN show substantial gains over model-free baselines and highlight practical benefits for real-world, multi-task robotic manipulation.

Abstract

Most recent successes in robot reinforcement learning involve learning a specialized single-task agent. However, robots capable of performing multiple tasks can be much more valuable in real-world applications. Multi-task reinforcement learning can be very challenging due to the increased sample complexity and the potentially conflicting task objectives. Previous work on this topic is dominated by model-free approaches. The latter can be very sample inefficient even when learning specialized single-task agents. In this work, we focus on model-based multi-task reinforcement learning. We propose a method for learning multi-task visual world models, leveraging pre-trained language models to extract semantically meaningful task representations. These representations are used by the world model and policy to reason about task similarity in dynamics and behavior. Our results highlight the benefits of using language-driven task representations for world models and a clear advantage of model-based multi-task learning over the more common model-free paradigm.
Paper Structure (25 sections, 13 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 25 sections, 13 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: We train a model-based actor-critic agent using a multi-task world model. The actor, critic and the world model are conditioned on tokens from a vision and proprioception tokenizer as well as a language embedding from a pretrained language model. The latter encodes instructions for the different tasks. The resulting agent can perform multiple tasks depending on the instruction input.
  • Figure 2: We plot the success rate over a fixed budget of training epochs. We consider model-free baselines based on multi-task soft-actor-critic (MT-SAC). MT-SAC is trained on raw images (MT-SAC:raw) and another variant is trained on image embeddings from our tokenizer (MT-SAC:token). We include a single-task MBRL baseline (MBRL-ST) and variants of our method that use integer task representations instead of language embeddings. The integer representations are either only used in the actor critic (LIMT:nlac) or in both the world model and actor-critic (LIMT:nl). LIMT and its variant have better sample efficiency and success rate.
  • Figure 3: (Left) We visualize the language embedding of our different task instructions in 2-dimension using t-SNE van2008visualizing. The circles and triangles represent instructions in the training and test datasets, respectively. Language embeddings of task instructions with semantic similarity tend to cluster well together. (Right) We compare the multi-task success rate of our method to the studied baselines. LIMT outperforms all baselines in the multi-task setting by a large margin.
  • Figure 4: Our agent is able to switch between different tasks during inference. We initialize our model with the instruction to flip up the switch on the right side of the table. The frames of the first row depict the agent trying to reach this goal, before the task description changes (second row). The task is then switched to move the slider to the left. After the agent has performed this task, we change the instruction again. The last row depicts the agent trying to move the slider to the right. By the end of the episode (last frame) the slider has returned to its initial position.
  • Figure 5: Reconstructions of imagined trajectories for the tasks turn_on_lightbulb (top) and close_drawer (bottom). The first frame in each row is initialized from the real environment.