FogROS2-FT: Fault Tolerant Cloud Robotics

Kaiyuan Chen; Kush Hari; Trinity Chung; Michael Wang; Nan Tian; Christian Juette; Jeffrey Ichnowski; Liu Ren; John Kubiatowicz; Ion Stoica; Ken Goldberg

FogROS2-FT: Fault Tolerant Cloud Robotics

Kaiyuan Chen, Kush Hari, Trinity Chung, Michael Wang, Nan Tian, Christian Juette, Jeffrey Ichnowski, Liu Ren, John Kubiatowicz, Ion Stoica, Ken Goldberg

TL;DR

FogROS2-FT tackles downtime and QoS variability in cloud robotics by deploying replicated, stateless ROS2 services across multiple clouds and routing the first response. The approach leverages a replication-aware proxy and SkyPilot multi-cloud provisioning to enable cost-effective use of spot VMs while maintaining per-request fault tolerance. Experimental results in simulation and a physical robot demonstrate dramatic reductions in long-tail latency (P99) and substantial cost savings, while maintaining reliability under network slowdowns and spot preemption. The work contributes an open-source fault-tolerant extension to FogROS2, a scalable architecture, and practical guidance for deploying fault-tolerant cloud robotics.

Abstract

Cloud robotics enables robots to offload complex computational tasks to cloud servers for performance and ease of management. However, cloud compute can be costly, cloud services can suffer occasional downtime, and connectivity between the robot and cloud can be prone to variations in network Quality-of-Service (QoS). We present FogROS2-FT (Fault Tolerant) to mitigate these issues by introducing a multi-cloud extension that automatically replicates independent stateless robotic services, routes requests to these replicas, and directs the first response back. With replication, robots can still benefit from cloud computations even when a cloud service provider is down or there is low QoS. Additionally, many cloud computing providers offer low-cost spot computing instances that may shutdown unpredictably. Normally, these low-cost instances would be inappropriate for cloud robotics, but the fault tolerance nature of FogROS2-FT allows them to be used reliably. We demonstrate FogROS2-FT fault tolerance capabilities in 3 cloud-robotics scenarios in simulation (visual object detection, semantic segmentation, motion planning) and 1 physical robot experiment (scan-pick-and-place). Running on the same hardware specification, FogROS2-FT achieves motion planning with up to 2.2x cost reduction and up to a 5.53x reduction on 99 Percentile (P99) long-tail latency. FogROS2-FT reduces the P99 long-tail latency of object detection and semantic segmentation by 2.0x and 2.1x, respectively, under network slowdown and resource contention.

FogROS2-FT: Fault Tolerant Cloud Robotics

TL;DR

Abstract

FogROS2-FT: Fault Tolerant Cloud Robotics

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)