Table of Contents
Fetching ...

Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks

Gabriela Sejnova, Michal Vavrecka, Karla Stepanova

TL;DR

This work investigates unsupervised vision-language-action learning for robotic manipulation using multimodal VAEs. By adapting three state-of-the-art multimodal VAE architectures (MVAE, MMVAE, MoPoE) with modality-specific encoders/decoders and a model-independent sigma-VAE training objective, the authors enable end-to-end trajectory generation from image and natural language inputs. Evaluations across 36 synthetic LANRO datasets reveal that the sigma-VAE objective often improves reconstruction and task success (up to 55% in some cases) and that MVAE generally provides the most robust performance, though task complexity and sequence length remain challenging. The study highlights the strengths and limitations of current multimodal VAEs for unsupervised robotic motion learning and suggests future directions such as subtask chaining and multi-object scene understanding, with code and experiments available for reproducibility.

Abstract

In this work, we focus on unsupervised vision-language-action mapping in the area of robotic manipulation. Recently, multiple approaches employing pre-trained large language and vision models have been proposed for this task. However, they are computationally demanding and require careful fine-tuning of the produced outputs. A more lightweight alternative would be the implementation of multimodal Variational Autoencoders (VAEs) which can extract the latent features of the data and integrate them into a joint representation, as has been demonstrated mostly on image-image or image-text data for the state-of-the-art models. Here we explore whether and how can multimodal VAEs be employed in unsupervised robotic manipulation tasks in a simulated environment. Based on the obtained results, we propose a model-invariant training alternative that improves the models' performance in a simulator by up to 55%. Moreover, we systematically evaluate the challenges raised by the individual tasks such as object or robot position variability, number of distractors or the task length. Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories based on vision and language.

Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks

TL;DR

This work investigates unsupervised vision-language-action learning for robotic manipulation using multimodal VAEs. By adapting three state-of-the-art multimodal VAE architectures (MVAE, MMVAE, MoPoE) with modality-specific encoders/decoders and a model-independent sigma-VAE training objective, the authors enable end-to-end trajectory generation from image and natural language inputs. Evaluations across 36 synthetic LANRO datasets reveal that the sigma-VAE objective often improves reconstruction and task success (up to 55% in some cases) and that MVAE generally provides the most robust performance, though task complexity and sequence length remain challenging. The study highlights the strengths and limitations of current multimodal VAEs for unsupervised robotic motion learning and suggests future directions such as subtask chaining and multi-object scene understanding, with code and experiments available for reproducibility.

Abstract

In this work, we focus on unsupervised vision-language-action mapping in the area of robotic manipulation. Recently, multiple approaches employing pre-trained large language and vision models have been proposed for this task. However, they are computationally demanding and require careful fine-tuning of the produced outputs. A more lightweight alternative would be the implementation of multimodal Variational Autoencoders (VAEs) which can extract the latent features of the data and integrate them into a joint representation, as has been demonstrated mostly on image-image or image-text data for the state-of-the-art models. Here we explore whether and how can multimodal VAEs be employed in unsupervised robotic manipulation tasks in a simulated environment. Based on the obtained results, we propose a model-invariant training alternative that improves the models' performance in a simulator by up to 55%. Moreover, we systematically evaluate the challenges raised by the individual tasks such as object or robot position variability, number of distractors or the task length. Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories based on vision and language.
Paper Structure (24 sections, 5 equations, 5 figures, 2 tables)

This paper contains 24 sections, 5 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The proposed architecture scheme for the training procedure (top) and inference (bottom). During training, the language command, image and motion trajectory are fed into individual model encoders. The multimodal mixing method (MVAE, MMVAE or MoPoE) learns the joint posterior distribution. The reconstruction loss for each modality is calculated using the $\sigma$-VAE loss and used in the ELBO or DReG (for MMVAE) objective. During inference, the model predicts the whole motion trajectory based on the provided image and command.
  • Figure 2: Examples from the 36 datasets generated for our experiments. In each dataset, we use the top view of the scene with the robot as the visual input and the instruction as the text input (where necessary for task disambiguation). We also provide the task motion trajectories as another modality. A: the scene complexity (rows) and number of actions (tasks) concurrently represented by the model (columns), e.g., one action, move right, is represented by the model, or the model has to represent concurrently more actions (i.e., for 2 actions, the model learns together move right and lift). B: the task length (rows) and position variability (columns), i.e., Var. 1 varies object positions along the x axis, Var. 2. varies positions along axes x, y and Var. 3. additionally varies the robot position. For more details, see Section \ref{['sec:exp']}.
  • Figure 3: The objects used in our experimental setup: apple, soap and lemon.
  • Figure 4: Example of the reach+lift+insert+close sequence (ordered by numbers), which is the longest task in our datasets. The length of the sequence is up to 68 timesteps. For more examples, see the attached video.
  • Figure 5: Accuracy for the reach task based on the threshold of the final distance between the gripper and target reach position, which is 6 cm from the object centroid (optimal distance for grasping). We show the results for MVAE, MMVAE and MoPoE. 1 random corresponds to the dataset with 1 randomly placed object, the data for 1 random + 1 distractor also included a random distractor in the scene.