Table of Contents
Fetching ...

MuTT: A Multimodal Trajectory Transformer for Robot Skills

Claudius Kienle, Benjamin Alt, Onur Celik, Philipp Becker, Darko Katic, Rainer Jäkel, Gerhard Neumann

TL;DR

MuTT addresses the challenge of configuring robot skill parameters for dynamic environments by introducing an environment-aware encoder–decoder transformer that fuses vision, trajectory, and skill parameters. A novel trajectory projection preserves high temporal resolution and key force information, enabling accurate environment-conditioned trajectory predictions without real-world executions during optimization. The model serves as a predictor within model-based optimization frameworks and is demonstrated across three tasks and two skill representations, including industrial cable grasping, force-controlled plug insertion, and ManiSkill2 prodmp-based skills, showing improvements in prediction accuracy and task success. This work provides a generalizable foundation for rapid adaptation of robot skills to current environments, with potential for fast adaptation and reduced real-world trial requirements in industrial and research settings.

Abstract

High-level robot skills represent an increasingly popular paradigm in robot programming. However, configuring the skills' parameters for a specific task remains a manual and time-consuming endeavor. Existing approaches for learning or optimizing these parameters often require numerous real-world executions or do not work in dynamic environments. To address these challenges, we propose MuTT, a novel encoder-decoder transformer architecture designed to predict environment-aware executions of robot skills by integrating vision, trajectory, and robot skill parameters. Notably, we pioneer the fusion of vision and trajectory, introducing a novel trajectory projection. Furthermore, we illustrate MuTT's efficacy as a predictor when combined with a model-based robot skill optimizer. This approach facilitates the optimization of robot skill parameters for the current environment, without the need for real-world executions during optimization. Designed for compatibility with any representation of robot skills, MuTT demonstrates its versatility across three comprehensive experiments, showcasing superior performance across two different skill representations.

MuTT: A Multimodal Trajectory Transformer for Robot Skills

TL;DR

MuTT addresses the challenge of configuring robot skill parameters for dynamic environments by introducing an environment-aware encoder–decoder transformer that fuses vision, trajectory, and skill parameters. A novel trajectory projection preserves high temporal resolution and key force information, enabling accurate environment-conditioned trajectory predictions without real-world executions during optimization. The model serves as a predictor within model-based optimization frameworks and is demonstrated across three tasks and two skill representations, including industrial cable grasping, force-controlled plug insertion, and ManiSkill2 prodmp-based skills, showing improvements in prediction accuracy and task success. This work provides a generalizable foundation for rapid adaptation of robot skills to current environments, with potential for fast adaptation and reduced real-world trial requirements in industrial and research settings.

Abstract

High-level robot skills represent an increasingly popular paradigm in robot programming. However, configuring the skills' parameters for a specific task remains a manual and time-consuming endeavor. Existing approaches for learning or optimizing these parameters often require numerous real-world executions or do not work in dynamic environments. To address these challenges, we propose MuTT, a novel encoder-decoder transformer architecture designed to predict environment-aware executions of robot skills by integrating vision, trajectory, and robot skill parameters. Notably, we pioneer the fusion of vision and trajectory, introducing a novel trajectory projection. Furthermore, we illustrate MuTT's efficacy as a predictor when combined with a model-based robot skill optimizer. This approach facilitates the optimization of robot skill parameters for the current environment, without the need for real-world executions during optimization. Designed for compatibility with any representation of robot skills, MuTT demonstrates its versatility across three comprehensive experiments, showcasing superior performance across two different skill representations.
Paper Structure (17 sections, 1 equation, 7 figures, 2 tables)

This paper contains 17 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: mutt is used in the spi parameter optimizer alt_robot_2021 to refine the initial search pattern (red dots, top). The optimization yields an improved search pattern (red dots, bottom), reducing the required probes from six to two to successfully locate the socket. This significantly decreases the cycle time of the task. As an environment-aware model of robot skills, mutt enables the optimization of skill parameters for the current environment, alleviating the need for real-world executions during optimization.
  • Figure 2: mutt architecture: modality specific embedding of the simulated trajectory (blue), skill parameters (green) and environment image (red) into tokens, which are concatenated to one token sequence. All tokens are coded with modality specific positional and token-type encoding and passed through an encoder transformer. The decoder predicts the real-world trajectory (purple) in an autoregressive manner.
  • Figure 3: Evaluation scenarios: Real-world grasping of deformable cables (Experiment \ref{['sec:experiment-grasp-skill']}, left), real-world force-controlled plug insertion under uncertainty (Experiment \ref{['sec:experiment-plug-skill']}, middle), and simulated grasping in the ManiSkill2 benchmark (Experiment \ref{['sec:experiment-maniskill']}, right).
  • Figure 4: Comparison of training mutt on the dataset of Experiment \ref{['sec:experiment-grasp-skill']} with different initial weights from related applications ao_speecht5_2022dosovitskiy_image_2021kim_vilt_2021hsu_hubert_2021. Initialization with ViT dosovitskiy_image_2021 resulted in the best evaluation performance.
  • Figure 5: Experiment \ref{['sec:experiment-plug-skill']}: End-effector pose (along Z axis, left) and force (along Z axis, middle) prediction of mutt for a probe skill. mutt predicts forces accurately, enabling the optimization with spi alt_robot_2021 of the robot skill to adhere to the user-defined force limit of 6 N, while the unoptimized skill greatly exceeds that limit (right).
  • ...and 2 more figures