Table of Contents
Fetching ...

A generic approach for reactive stateful mitigation of application failures in distributed robotics systems deployed with Kubernetes

Florian Mirus, Frederik Pasch, Nikhil Singhal, Kay-Ulrich Scholl

TL;DR

This paper proposes a novel approach for robotic system monitoring and stateful, reactive failure mitigation for distributed robotic systems deployed using Kubernetes and the Robot Operating System and demonstrates the effectiveness and application-agnosticism of this approach on two example applications.

Abstract

Offloading computationally expensive algorithms to the edge or even cloud offers an attractive option to tackle limitations regarding on-board computational and energy resources of robotic systems. In cloud-native applications deployed with the container management system Kubernetes (K8s), one key problem is ensuring resilience against various types of failures. However, complex robotic systems interacting with the physical world pose a very specific set of challenges and requirements that are not yet covered by failure mitigation approaches from the cloud-native domain. In this paper, we therefore propose a novel approach for robotic system monitoring and stateful, reactive failure mitigation for distributed robotic systems deployed using Kubernetes (K8s) and the Robot Operating System (ROS2). By employing the generic substrate of Behaviour Trees, our approach can be applied to any robotic workload and supports arbitrarily complex monitoring and failure mitigation strategies. We demonstrate the effectiveness and application-agnosticism of our approach on two example applications, namely Autonomous Mobile Robot (AMR) navigation and robotic manipulation in a simulated environment.

A generic approach for reactive stateful mitigation of application failures in distributed robotics systems deployed with Kubernetes

TL;DR

This paper proposes a novel approach for robotic system monitoring and stateful, reactive failure mitigation for distributed robotic systems deployed using Kubernetes and the Robot Operating System and demonstrates the effectiveness and application-agnosticism of this approach on two example applications.

Abstract

Offloading computationally expensive algorithms to the edge or even cloud offers an attractive option to tackle limitations regarding on-board computational and energy resources of robotic systems. In cloud-native applications deployed with the container management system Kubernetes (K8s), one key problem is ensuring resilience against various types of failures. However, complex robotic systems interacting with the physical world pose a very specific set of challenges and requirements that are not yet covered by failure mitigation approaches from the cloud-native domain. In this paper, we therefore propose a novel approach for robotic system monitoring and stateful, reactive failure mitigation for distributed robotic systems deployed using Kubernetes (K8s) and the Robot Operating System (ROS2). By employing the generic substrate of Behaviour Trees, our approach can be applied to any robotic workload and supports arbitrarily complex monitoring and failure mitigation strategies. We demonstrate the effectiveness and application-agnosticism of our approach on two example applications, namely Autonomous Mobile Robot (AMR) navigation and robotic manipulation in a simulated environment.

Paper Structure

This paper contains 21 sections, 3 equations, 8 figures.

Figures (8)

  • Figure 1: High-level overview of the monitoring and failure mitigation system.
  • Figure 2: Factors and possible weights for selection of Failure Monitoring and Mitigation Strategy.
  • Figure 3: System Architecture
  • Figure 4: Example Behaviour Tree to apply basic failure mitigation for a task and an automatically restarting workload (e.g., using Kubernetes deployment)
  • Figure 5: Required steps for the different failure mitigation strategies and their trade-off in terms of failure mitigation time and resource usage. Depending on the provisioned fallback workload, the steps of restarting the container, workload startup and state initialization need to be executed.
  • ...and 3 more figures