Table of Contents
Fetching ...

Multi-Task Interactive Robot Fleet Learning with Visual World Models

Huihan Liu, Yu Zhang, Vaarij Betala, Evan Zhang, James Liu, Crystal Ding, Yuke Zhu

TL;DR

Sirius-Fleet is introduced, a multi-task interactive robot fleet learning framework to address challenges with generalization and robustness when exposed to real-world variability and uncertainty, and its effectiveness in improving multi-task policy performance and monitoring accuracy.

Abstract

Recent advancements in large-scale multi-task robot learning offer the potential for deploying robot fleets in household and industrial settings, enabling them to perform diverse tasks across various environments. However, AI-enabled robots often face challenges with generalization and robustness when exposed to real-world variability and uncertainty. We introduce Sirius-Fleet, a multi-task interactive robot fleet learning framework to address these challenges. Sirius-Fleet monitors robot performance during deployment and involves humans to correct the robot's actions when necessary. We employ a visual world model to predict the outcomes of future actions and build anomaly predictors to predict whether they will likely result in anomalies. As the robot autonomy improves, the anomaly predictors automatically adapt their prediction criteria, leading to fewer requests for human intervention and gradually reducing human workload over time. Evaluations on large-scale benchmarks demonstrate Sirius-Fleet's effectiveness in improving multi-task policy performance and monitoring accuracy. We demonstrate Sirius-Fleet's performance in both RoboCasa in simulation and Mutex in the real world, two diverse, large-scale multi-task benchmarks. More information is available on the project website: https://ut-austin-rpl.github.io/sirius-fleet

Multi-Task Interactive Robot Fleet Learning with Visual World Models

TL;DR

Sirius-Fleet is introduced, a multi-task interactive robot fleet learning framework to address challenges with generalization and robustness when exposed to real-world variability and uncertainty, and its effectiveness in improving multi-task policy performance and monitoring accuracy.

Abstract

Recent advancements in large-scale multi-task robot learning offer the potential for deploying robot fleets in household and industrial settings, enabling them to perform diverse tasks across various environments. However, AI-enabled robots often face challenges with generalization and robustness when exposed to real-world variability and uncertainty. We introduce Sirius-Fleet, a multi-task interactive robot fleet learning framework to address these challenges. Sirius-Fleet monitors robot performance during deployment and involves humans to correct the robot's actions when necessary. We employ a visual world model to predict the outcomes of future actions and build anomaly predictors to predict whether they will likely result in anomalies. As the robot autonomy improves, the anomaly predictors automatically adapt their prediction criteria, leading to fewer requests for human intervention and gradually reducing human workload over time. Evaluations on large-scale benchmarks demonstrate Sirius-Fleet's effectiveness in improving multi-task policy performance and monitoring accuracy. We demonstrate Sirius-Fleet's performance in both RoboCasa in simulation and Mutex in the real world, two diverse, large-scale multi-task benchmarks. More information is available on the project website: https://ut-austin-rpl.github.io/sirius-fleet

Paper Structure

This paper contains 22 sections, 20 figures, 7 tables.

Figures (20)

  • Figure 1: Overview of Sirius-Fleet. Our framework of multi-task interactive robot fleet learning consists of two stages: 1) Visual World Model Training & Inference, where we pre-train a visual world model on diverse datasets to predict future latent states from past visual observations, and 2) Multi-Task Interactive Fleet Learning, where anomaly predictors, built upon the pre-trained visual world model, enable real-time monitoring of the multi-task robot fleet during deployment, and solicit human feedback when necessary. The policy and anomaly predictors are continuously fine-tuned with deployment data, improving task performance over time.
  • Figure 2: Model Architecture. The visual world model comprises a UNet-based encoder and decoder combined with a cVAE- and Transformer-based prediction model. This architecture allows the world model to predict future embeddings from the current state. The learned representations are then used for anomaly predictions, including failure and OOD prediction.
  • Figure 3: Adaptive Decision Boundaries. Top: OOD Prediction Boundary. The threshold $\theta_g$, determined by the human intervention ratio, sets the distance threshold $\alpha_g$. A sample is identified as OOD if its embedding's distance $d$ from the cluster centroid exceeds $\alpha_g$. Bottom: Fitting function for optimal $\theta_g$ based on the human intervention ratio $p_H$. The x-axis shows $1 - p_H$, representing the autonomous rollout ratio.
  • Figure 4: Policy Architecture. The multi-task policy is a Transformer that processes images, proprioceptive data, and task language embeddings. It uses a Gaussian Mixture Model (GMM) to output robot actions.
  • Figure 5: RoboCasa Simulation Tasks and Mutex Real-World Tasks. We evaluate policy learning and runtime monitoring using 12 tasks from the RoboCasa benchmark in simulation and 10 tasks from the Mutex benchmark in real-world environments.
  • ...and 15 more figures