Table of Contents
Fetching ...

FogROS2-FT: Fault Tolerant Cloud Robotics

Kaiyuan Chen, Kush Hari, Trinity Chung, Michael Wang, Nan Tian, Christian Juette, Jeffrey Ichnowski, Liu Ren, John Kubiatowicz, Ion Stoica, Ken Goldberg

TL;DR

FogROS2-FT tackles downtime and QoS variability in cloud robotics by deploying replicated, stateless ROS2 services across multiple clouds and routing the first response. The approach leverages a replication-aware proxy and SkyPilot multi-cloud provisioning to enable cost-effective use of spot VMs while maintaining per-request fault tolerance. Experimental results in simulation and a physical robot demonstrate dramatic reductions in long-tail latency (P99) and substantial cost savings, while maintaining reliability under network slowdowns and spot preemption. The work contributes an open-source fault-tolerant extension to FogROS2, a scalable architecture, and practical guidance for deploying fault-tolerant cloud robotics.

Abstract

Cloud robotics enables robots to offload complex computational tasks to cloud servers for performance and ease of management. However, cloud compute can be costly, cloud services can suffer occasional downtime, and connectivity between the robot and cloud can be prone to variations in network Quality-of-Service (QoS). We present FogROS2-FT (Fault Tolerant) to mitigate these issues by introducing a multi-cloud extension that automatically replicates independent stateless robotic services, routes requests to these replicas, and directs the first response back. With replication, robots can still benefit from cloud computations even when a cloud service provider is down or there is low QoS. Additionally, many cloud computing providers offer low-cost spot computing instances that may shutdown unpredictably. Normally, these low-cost instances would be inappropriate for cloud robotics, but the fault tolerance nature of FogROS2-FT allows them to be used reliably. We demonstrate FogROS2-FT fault tolerance capabilities in 3 cloud-robotics scenarios in simulation (visual object detection, semantic segmentation, motion planning) and 1 physical robot experiment (scan-pick-and-place). Running on the same hardware specification, FogROS2-FT achieves motion planning with up to 2.2x cost reduction and up to a 5.53x reduction on 99 Percentile (P99) long-tail latency. FogROS2-FT reduces the P99 long-tail latency of object detection and semantic segmentation by 2.0x and 2.1x, respectively, under network slowdown and resource contention.

FogROS2-FT: Fault Tolerant Cloud Robotics

TL;DR

FogROS2-FT tackles downtime and QoS variability in cloud robotics by deploying replicated, stateless ROS2 services across multiple clouds and routing the first response. The approach leverages a replication-aware proxy and SkyPilot multi-cloud provisioning to enable cost-effective use of spot VMs while maintaining per-request fault tolerance. Experimental results in simulation and a physical robot demonstrate dramatic reductions in long-tail latency (P99) and substantial cost savings, while maintaining reliability under network slowdowns and spot preemption. The work contributes an open-source fault-tolerant extension to FogROS2, a scalable architecture, and practical guidance for deploying fault-tolerant cloud robotics.

Abstract

Cloud robotics enables robots to offload complex computational tasks to cloud servers for performance and ease of management. However, cloud compute can be costly, cloud services can suffer occasional downtime, and connectivity between the robot and cloud can be prone to variations in network Quality-of-Service (QoS). We present FogROS2-FT (Fault Tolerant) to mitigate these issues by introducing a multi-cloud extension that automatically replicates independent stateless robotic services, routes requests to these replicas, and directs the first response back. With replication, robots can still benefit from cloud computations even when a cloud service provider is down or there is low QoS. Additionally, many cloud computing providers offer low-cost spot computing instances that may shutdown unpredictably. Normally, these low-cost instances would be inappropriate for cloud robotics, but the fault tolerance nature of FogROS2-FT allows them to be used reliably. We demonstrate FogROS2-FT fault tolerance capabilities in 3 cloud-robotics scenarios in simulation (visual object detection, semantic segmentation, motion planning) and 1 physical robot experiment (scan-pick-and-place). Running on the same hardware specification, FogROS2-FT achieves motion planning with up to 2.2x cost reduction and up to a 5.53x reduction on 99 Percentile (P99) long-tail latency. FogROS2-FT reduces the P99 long-tail latency of object detection and semantic segmentation by 2.0x and 2.1x, respectively, under network slowdown and resource contention.

Paper Structure

This paper contains 18 sections, 4 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: FogROS2-FT Overview.(Top) Cloud robotics applications, such as grasp planning, when deployed on a single cloud server become a single point of failure. (Bottom) Instead, FogROS2-FT provides a cost-efficient and fault-tolerant solution that deploys unmodified ROS2 applications to multiple low-cost cloud servers, making cloud-robotics applications resilient to individual server termination and network slowdowns.
  • Figure 2: System Overview of FogROS2-FT FogROS2-FT transparently proxies ROS2 communication. It sends requests to multiple replicated spot VMs, and routes the first response back to the robot. FogROS2-FT manages spot VMs to resiliently recover from unpredictable terminations.
  • Figure 3: Flow diagram of FogROS2-FT on handling new requests The FogROS2-FT replication-aware proxy handles ROS2 grasp planning request with fault tolerance guarantees with multiple steps. (1) The ROS2 application sends a request on the local ROS2 network. (2) The proxy running on the robot receives the request and extracts the content and a unique identifier from ROS2 middleware (rmw) layer buffer. (3) The proxy registers the unique ID from rmw with the handle, which includes a callback function if the response of the request arrives and a callback function for timeout. (4) The proxy securely sends the request to proxies running on replicated Cloud machines. There can be multiple proxy hops between the robot and the server that hosts the desired ROS2 service. The request message carries the unique identifier and the proxy adds an entry in the registry table. (5) The proxy running on the cloud converts the message to a standard ROS2 request message, and invokes the ROS2 service on the cloud and gets the response. (6) The proxy on the cloud sends the response back to the proxy on the robot. (7.A) The robot checks if it handled the response with the unique identifier; (7.B On the duplicated responses) The proxy drops the response if it was already handled. (7.C On timeout) The proxy calls the timeout handler (such as returns with empty response) and cleans up the registry table. (8) The robot sends the response to the application on the robot through standard ROS2 protocol.
  • Figure 4: Flexible Topology for Different Bandwidth of Robots(a) Since FogROS2-FT sends replicated requests to multiple cloud machines, it demands more network bandwidth than conventional cloud-robotics deployments. (b) FogROS2-FT allows flexible topology so that low-bandwidth robot can leverage cloud machines with higher bandwidth to forward to replicated services. One can either use dedicated gateway machine (left) or existing compute servers (right).
  • Figure 5: FogROS2-FT Latency on Motion Planning Template We tested FogROS2-FT on 3 different motion planning environments (columns (a), (b), and (c)). Due to the stochastic nature of the algorithm, we aggregated results for each scenario and server configuration over 100 trials with a 100 s timeout. The (top row) shows the frequency histogram for the scenario when run with Single-Server (in blue). The (middle row) shows the frequency histogram for the scenarios when run on 2 servers with FogROS2-FT (in orange). With all scenarios, the shift left of FogROS2-FT histograms (in orange) relative to their corresponding single-server histograms (in blue) indicates improved latency performance when running on replicated servers. The (bottom row) compares the cumulative distribution functions (CDF) for single-server (in blue) and two servers (in orange). The two-server CDF is left relative to the single-server CDF indicating an overall improved performance with lower average latency for all scenarios.
  • ...and 4 more figures