MuTT: A Multimodal Trajectory Transformer for Robot Skills
Claudius Kienle, Benjamin Alt, Onur Celik, Philipp Becker, Darko Katic, Rainer Jäkel, Gerhard Neumann
TL;DR
MuTT addresses the challenge of configuring robot skill parameters for dynamic environments by introducing an environment-aware encoder–decoder transformer that fuses vision, trajectory, and skill parameters. A novel trajectory projection preserves high temporal resolution and key force information, enabling accurate environment-conditioned trajectory predictions without real-world executions during optimization. The model serves as a predictor within model-based optimization frameworks and is demonstrated across three tasks and two skill representations, including industrial cable grasping, force-controlled plug insertion, and ManiSkill2 prodmp-based skills, showing improvements in prediction accuracy and task success. This work provides a generalizable foundation for rapid adaptation of robot skills to current environments, with potential for fast adaptation and reduced real-world trial requirements in industrial and research settings.
Abstract
High-level robot skills represent an increasingly popular paradigm in robot programming. However, configuring the skills' parameters for a specific task remains a manual and time-consuming endeavor. Existing approaches for learning or optimizing these parameters often require numerous real-world executions or do not work in dynamic environments. To address these challenges, we propose MuTT, a novel encoder-decoder transformer architecture designed to predict environment-aware executions of robot skills by integrating vision, trajectory, and robot skill parameters. Notably, we pioneer the fusion of vision and trajectory, introducing a novel trajectory projection. Furthermore, we illustrate MuTT's efficacy as a predictor when combined with a model-based robot skill optimizer. This approach facilitates the optimization of robot skill parameters for the current environment, without the need for real-world executions during optimization. Designed for compatibility with any representation of robot skills, MuTT demonstrates its versatility across three comprehensive experiments, showcasing superior performance across two different skill representations.
