Bringing the RT-1-X Foundation Model to a SCARA robot

Jonathan Salzer; Arnoud Visser

Bringing the RT-1-X Foundation Model to a SCARA robot

Jonathan Salzer, Arnoud Visser

TL;DR

Investigation of the generalization capabilities of the RT-1-X robotic foundation model to a type of robot unseen during its training: a SCARA robot from UMI-RTX reveals that RT-1-X does not generalize zero-shot to the unseen type of robot.

Abstract

Traditional robotic systems require specific training data for each task, environment, and robot form. While recent advancements in machine learning have enabled models to generalize across new tasks and environments, the challenge of adapting these models to entirely new settings remains largely unexplored. This study addresses this by investigating the generalization capabilities of the RT-1-X robotic foundation model to a type of robot unseen during its training: a SCARA robot from UMI-RTX. Initial experiments reveal that RT-1-X does not generalize zero-shot to the unseen type of robot. However, fine-tuning of the RT-1-X model by demonstration allows the robot to learn a pickup task which was part of the foundation model (but learned for another type of robot). When the robot is presented with an object that is included in the foundation model but not in the fine-tuning dataset, it demonstrates that only the skill, but not the object-specific knowledge, has been transferred.

Bringing the RT-1-X Foundation Model to a SCARA robot

TL;DR

Abstract

Paper Structure (30 sections, 5 figures, 2 tables)

This paper contains 30 sections, 5 figures, 2 tables.

Introduction
Theoretical Background
RT-1
Universal Sentence Encoder
FiLM
EfficientNet
Input Tokenization
TokenLearner
Open X-Embodiment and RT-1-X
Pre-trained models
Dataset Composition
Pre-trained model RT-1-X
Action and Observation Space Alignment
UMI RTX Robotic Embodiment
History and Use Cases
...and 15 more sections

Figures (5)

Figure 1: The UMI-RTX robot with an object in its working space.
Figure 2: A simplified version of the RT-1 architecture, showing the input of an image history and language instruction, application of the various modules, and the robot action output format.
Figure 3: Using different batch sizes in training had significant effects on the noise in the loss curves. Compared here are training loss curves for batch size 5 (red) and batch size 2 (blue), with all other parameters unchanged. The biggest usable batch size in this research is five, due to hardware limitations.
Figure 4: The choice of the learning rate proved to be essential for the quality of the training process. The chart shows the comparison of different learning rates when fine-tuning RT-1-X on the UMI dataset. A learning rate of 5e-6 was ultimately chosen.
Figure 5: Inference is run with images from the UMI fine-tuning dataset to verify that fine-tuning was effective. Output of RT-1-X run with images from the UMI dataset is shown in red, compared to actions recorded during demonstration (ground truth) in blue.

Bringing the RT-1-X Foundation Model to a SCARA robot

TL;DR

Abstract

Bringing the RT-1-X Foundation Model to a SCARA robot

Authors

TL;DR

Abstract

Table of Contents

Figures (5)