Anticipating Degradation: A Predictive Approach to Fault Tolerance in Robot Swarms
James O'Keeffe
TL;DR
This work addresses the neglect of gradual hardware degradation in swarm fault tolerance by introducing a predictive maintenance framework for robot swarms. It combines degradation modeling with $d_l$, $d_r$, and $d_S$, an immune-inspired fault-detection algorithm operating on behavioural repertoires, and a comparative evaluation of predictive ($T_P$) versus reactive ($T_R$) fault resolution in GPF and LPF swarms using ROS 2 and Gazebo. The results indicate that predictive fault tolerance achieves competitive or superior performance in most scenarios and enables replacement or repair of faulty robots at the base, preserving hardware resources. Overall, the study demonstrates that timing faults within the optimal degradation window and allowing safe base-return are critical for maintaining swarm autonomy, offering practical insights for long-duration multi-robot deployments and guiding future improvements in detection reliability and online fault-resolution planning.
Abstract
An active approach to fault tolerance is essential for robot swarms to achieve long-term autonomy. Previous efforts have focused on responding to spontaneous electro-mechanical faults and failures. However, many faults occur gradually over time. Waiting until such faults have manifested as failures before addressing them is both inefficient and unsustainable in a variety of scenarios. This work argues that the principles of predictive maintenance, in which potential faults are resolved before they hinder the operation of the swarm, offer a promising means of achieving long-term fault tolerance. This is a novel approach to swarm fault tolerance, which is shown to give a comparable or improved performance when tested against a reactive approach in almost all cases tested.
