Table of Contents
Fetching ...

Anticipating Degradation: A Predictive Approach to Fault Tolerance in Robot Swarms

James O'Keeffe

TL;DR

This work addresses the neglect of gradual hardware degradation in swarm fault tolerance by introducing a predictive maintenance framework for robot swarms. It combines degradation modeling with $d_l$, $d_r$, and $d_S$, an immune-inspired fault-detection algorithm operating on behavioural repertoires, and a comparative evaluation of predictive ($T_P$) versus reactive ($T_R$) fault resolution in GPF and LPF swarms using ROS 2 and Gazebo. The results indicate that predictive fault tolerance achieves competitive or superior performance in most scenarios and enables replacement or repair of faulty robots at the base, preserving hardware resources. Overall, the study demonstrates that timing faults within the optimal degradation window and allowing safe base-return are critical for maintaining swarm autonomy, offering practical insights for long-duration multi-robot deployments and guiding future improvements in detection reliability and online fault-resolution planning.

Abstract

An active approach to fault tolerance is essential for robot swarms to achieve long-term autonomy. Previous efforts have focused on responding to spontaneous electro-mechanical faults and failures. However, many faults occur gradually over time. Waiting until such faults have manifested as failures before addressing them is both inefficient and unsustainable in a variety of scenarios. This work argues that the principles of predictive maintenance, in which potential faults are resolved before they hinder the operation of the swarm, offer a promising means of achieving long-term fault tolerance. This is a novel approach to swarm fault tolerance, which is shown to give a comparable or improved performance when tested against a reactive approach in almost all cases tested.

Anticipating Degradation: A Predictive Approach to Fault Tolerance in Robot Swarms

TL;DR

This work addresses the neglect of gradual hardware degradation in swarm fault tolerance by introducing a predictive maintenance framework for robot swarms. It combines degradation modeling with , , and , an immune-inspired fault-detection algorithm operating on behavioural repertoires, and a comparative evaluation of predictive () versus reactive () fault resolution in GPF and LPF swarms using ROS 2 and Gazebo. The results indicate that predictive fault tolerance achieves competitive or superior performance in most scenarios and enables replacement or repair of faulty robots at the base, preserving hardware resources. Overall, the study demonstrates that timing faults within the optimal degradation window and allowing safe base-return are critical for maintaining swarm autonomy, offering practical insights for long-duration multi-robot deployments and guiding future improvements in detection reliability and online fault-resolution planning.

Abstract

An active approach to fault tolerance is essential for robot swarms to achieve long-term autonomy. Previous efforts have focused on responding to spontaneous electro-mechanical faults and failures. However, many faults occur gradually over time. Waiting until such faults have manifested as failures before addressing them is both inefficient and unsustainable in a variety of scenarios. This work argues that the principles of predictive maintenance, in which potential faults are resolved before they hinder the operation of the swarm, offer a promising means of achieving long-term fault tolerance. This is a novel approach to swarm fault tolerance, which is shown to give a comparable or improved performance when tested against a reactive approach in almost all cases tested.

Paper Structure

This paper contains 4 sections, 7 equations, 5 figures, 4 tables, 3 algorithms.

Figures (5)

  • Figure 1: A: Experimental setup for 20 robots in the open environment. The base is highlighted in light green and resource nests are indicated by the three grey circles opposite the robots. B: Experimental setup for 20 robots in the constrained environment. C: Example of how clusters of shutdown faulty robots can impede swarm progress by obstructing operational robots from their goals. For ease of user differentiation during experiments, shutdown robots appear as dark featureless cylinders of equivalent dimensions to the lighter coloured functioning robots. D: Example of severe disruption caused by robots shutdown in already constrained spaces, completely blocking access in some cases.
  • Figure 2: Plots of \ref{['equ:power_wheels']} and \ref{['equ:velocity']} (left) and \ref{['equ:range_2']} (right)
  • Figure 3: The median resources collected by each robot in 15 minutes of simulated time for every combination of algorithm, environment, and swarm size. A comparison is shown for predictive ($T_P^*$) and reactive ($T_R^*$) fault resolutions, displayed as filled in or white bars, respectively. Fault resolutions are initiated for robots with any $d_{l,r,S} < d_0$.
  • Figure 4: The values of $\delta$ for detections of motor and sensor faults made by \ref{['algAAPD']} on $N = 10$ robots performing the GPF algorithm in the open environment for 15 minutes of simulated time.
  • Figure 5: The median resources collected in 15 minutes of simulated time in each combination of algorithm, environment, and swarm size. A comparison is shown for predictive ($T_P$) and reactive ($T_R$) fault resolutions initiated when a fault is detected by \ref{['algAAPD']}, as well as the highest performing instance of $T_R^*$ taken from the corresponding scenario in \ref{['fig:PFDDRAnalysis']}.