Table of Contents
Fetching ...

FogROS2-PLR: Probabilistic Latency-Reliability For Cloud Robotics

Kaiyuan Chen, Nan Tian, Christian Juette, Tianshuang Qiu, Liu Ren, John Kubiatowicz, Ken Goldberg

TL;DR

An impossibility triangle theorem is formulated for Latency reliability, Singleton server, and Commodity hardware, and FogROS2-PLR optimizes the selection of interfaces to servers to minimize the probability of missing a deadline.

Abstract

Cloud robotics enables robots to offload computationally intensive tasks to cloud servers for performance, cost, and ease of management. However, the network and cloud computing infrastructure are not designed for reliable timing guarantees, due to fluctuating Quality-of-Service (QoS). In this work, we formulate an impossibility triangle theorem for: Latency reliability, Singleton server, and Commodity hardware. The LSC theorem suggests that providing replicated servers with uncorrelated failures can exponentially reduce the probability of missing a deadline. We present FogROS2-Probabilistic Latency Reliability (PLR) that uses multiple independent network interfaces to send requests to replicated cloud servers and uses the first response back. We design routing mechanisms to discover, connect, and route through non-default network interfaces on robots. FogROS2-PLR optimizes the selection of interfaces to servers to minimize the probability of missing a deadline. We conduct a cloud-connected driving experiment with two 5G service providers, demonstrating FogROS2-PLR effectively provides smooth service quality even if one of the service providers experiences low coverage and base station handover. We use 99 Percentile (P99) latency to evaluate anomalous long-tail latency behavior. In one experiment, FogROS2-PLR improves P99 latency by up to 3.7x compared to using one service provider. We deploy FogROS2-PLR on a physical Stretch 3 robot performing an indoor human-tracking task. Even in a fully covered Wi-Fi and 5G environment, FogROS2-PLR improves the responsiveness of the robot reducing mean latency by 36% and P99 latency by 33%.

FogROS2-PLR: Probabilistic Latency-Reliability For Cloud Robotics

TL;DR

An impossibility triangle theorem is formulated for Latency reliability, Singleton server, and Commodity hardware, and FogROS2-PLR optimizes the selection of interfaces to servers to minimize the probability of missing a deadline.

Abstract

Cloud robotics enables robots to offload computationally intensive tasks to cloud servers for performance, cost, and ease of management. However, the network and cloud computing infrastructure are not designed for reliable timing guarantees, due to fluctuating Quality-of-Service (QoS). In this work, we formulate an impossibility triangle theorem for: Latency reliability, Singleton server, and Commodity hardware. The LSC theorem suggests that providing replicated servers with uncorrelated failures can exponentially reduce the probability of missing a deadline. We present FogROS2-Probabilistic Latency Reliability (PLR) that uses multiple independent network interfaces to send requests to replicated cloud servers and uses the first response back. We design routing mechanisms to discover, connect, and route through non-default network interfaces on robots. FogROS2-PLR optimizes the selection of interfaces to servers to minimize the probability of missing a deadline. We conduct a cloud-connected driving experiment with two 5G service providers, demonstrating FogROS2-PLR effectively provides smooth service quality even if one of the service providers experiences low coverage and base station handover. We use 99 Percentile (P99) latency to evaluate anomalous long-tail latency behavior. In one experiment, FogROS2-PLR improves P99 latency by up to 3.7x compared to using one service provider. We deploy FogROS2-PLR on a physical Stretch 3 robot performing an indoor human-tracking task. Even in a fully covered Wi-Fi and 5G environment, FogROS2-PLR improves the responsiveness of the robot reducing mean latency by 36% and P99 latency by 33%.
Paper Structure (10 sections, 7 equations, 7 figures)

This paper contains 10 sections, 7 equations, 7 figures.

Figures (7)

  • Figure 1: FogROS2-PLR Use Case. A mobile robot in a warehouse connects to the cloud for vision, planning, and coordination. A smooth connection is required for safety and responsiveness. (Left) Conventional cloud robotics is subject to a single point of failure. In the top and middle, network or server failure leads to a complete breakdown of the system. At the bottom, transition to an alternative network or server at slowdown leads to QoS degradation. (Right) Instead, FogROS2-PLR provides a fault-tolerant solution that deploys unmodified ROS2 applications to multiple low-cost cloud servers, making cloud-robotics applications resilient to individual server termination and network slowdowns.
  • Figure 2: Impossibility Triangle of LSC Theorem Among probabilistic Latency Reliability, Singleton deployment, and Commodity infrastructure, a cloud robotics system can have at most two of these three properties. We characterize FogROS2-PLR and its related work on the edges of the impossibility triangle.
  • Figure 3: System Overview of FogROS2-PLR FogROS2-PLR transparently proxies ROS2 communication. It sends requests through multiple network interfaces (such as 5G and Wi-Fi) to replicated Cloud VMs. It uses the first response back to the robot.
  • Figure 4: Workflow Diagram of FogROS2-PLR On Setup (Black circle), (1) FogROS2-PLR proxy instantiates threads per interface (5G, Wi-Fi, and Cloud Network Interface Card (NIC) in Green) per service to handle communication ; (2) The thread communicates with a centralized connectivity server. The connectivity server runs STUN protocol to get the public IP address of the given interface; (3) The thread advertises the service-address binding to a centralized discovery service. The discovery service facilitates the cloud and robot to discover each other. To handle an incoming request from the robot (White circle), (1) The ROS2 application sends a request to FogROS2-PLR proxy; (2) The proxy retrieves a unique request ID from ROS2 Data Distribution Service (DDS) and raw request in bytes; (3) The request goes through optimizer that determines network interface and cloud server mapping; (4) The forwarder registers the unique request ID from rmw with a callback for response and a callback for timeout; (5) The forwarder replicates and sends the request to the corresponding server specified by the optimizer mapping; (6) The message is routed in the failure-independent networks from the robot to the cloud; (7) The FogROS2-PLR proxy converts the request to a regular ROS2 service request; (8) The cloud proxy invokes the cloud ROS2 service and gets the response; (9) The response is forwarded back the robot; (10) The robot proxy uses the first received response from replicated network interfaces and servers, and returns with a regular ROS2 service response to the application. It can optionally invoke a timeout callback if the response is not returned promptly.
  • Figure 5: Case Study (A) High Mobility Cloud Operation with FogROS2-PLR (A) 50-mile Driving route from Sunnyvale, CA to Berkeley, CA with 50 miles. (B) In the mobility test, the car experiences a coverage blind spot of 5G service providers. 5G handover: the base station passes the connectivity to the next base station at mobility. Both coverage and handover lead to QoS degradation. FogROS2-PLR prevents such interruption by an independent 5G service provider. (3) While relying on a single service provider leads to QoS fluctuations, FogROS2-PLR demonstrates smooth connectivity by using multiple 5G networks. FogROS2-PLR improves P99 latency of AT&T and Verizon respectively by 3.7x and 2.4x; it improves mean latency by respectively 2.7x and 1.9x.
  • ...and 2 more figures