A New Probabilistic Mobile Byzantine Failure Model for Self-Protecting Systems
Silvia Bonomi, Giovanni Farina, Roy Friedman, Eviatar B. Procaccia, Sebastien Tixeuil
TL;DR
This work tackles the problem of predicting and managing security in self-protecting distributed systems under evolving Byzantine threats. It introduces a probabilistic Mobile Byzantine Failure (MBF) model within a MAPe-K-based architecture and develops discrete- and continuous-time Markov chain formulations (DTMC/CTMC) to analyze infection and recovery dynamics. The authors derive hitting-time and stationary-distribution results across three CTMC sub-models (External/Internal/Coordinated), and validate these insights with extensive simulations, showing how infection and recovery rates shape safe operating windows and recovery strategies. The findings enable proactive configuration planning and selective local versus global rejuvenation, offering a principled approach to maintain Byzantine resilience in real-time systems with provable probabilistic guarantees.
Abstract
Modern distributed systems face growing security threats, as attackers continuously enhance their skills and vulnerabilities span across the entire system stack, from hardware to the application layer. In the system design phase, fault tolerance techniques can be employed to safeguard systems. From a theoretical perspective, an attacker attempting to compromise a system can be abstracted by considering the presence of Byzantine processes in the system. Although this approach enhances the resilience of the distributed system, it introduces certain limitations regarding the accuracy of the model in reflecting real-world scenarios. In this paper, we consider a self-protecting distributed system based on the \emph{Monitoring-Analyse-Plan-Execute over a shared Knowledge} (MAPE-K) architecture, and we propose a new probabilistic Mobile Byzantine Failure (MBF) that can be plugged into the Analysis component. Our new model captures the dynamics of evolving attacks and can be used to drive the self-protection and reconfiguration strategy. We analyze mathematically the time that it takes until the number of Byzantine nodes crosses given thresholds, or for the system to self-recover back into a safe state, depending on the rates of Byzantine infection spreading \emph{vs.} the rate of self-recovery. We also provide simulation results that illustrate the behavior of the system under such assumptions.
