A Logic for Repair and State Recovery in Byzantine Fault-tolerant Multi-agent Systems
Hans van Ditmarsch, Krisztina Fruzsa, Roman Kuznets, Ulrich Schmid
TL;DR
This work develops a dynamic epistemic framework for byzantine fault-tolerant multi-agent systems, enabling reasoning about agent correctness and enabling repairs and state recovery. It introduces a knowledge-plus-hope language $K_i$, $H_i$ with three update paradigms: public, private, and factual-change, each with complete axiomatizations and reduction mechanisms that preserve expressivity. The framework supports modeling fault-detection, isolation, and recovery (FDIR) techniques, including centralized and distributed corrections, self-diagnosis, and state restoration, and provides formal tools to verify update specifications and invariants. By enabling formal verification of FDIR protocols and integrating model updates with state and knowledge dynamics, the approach offers a principled foundation for reliable, mission-critical distributed systems and paves the way for embedding into runs-and-systems formalisms.
Abstract
We provide an epistemic logical language and semantics for the modeling and analysis of byzantine fault-tolerant multi-agent systems. This not only facilitates reasoning about the agents' fault status but also supports model updates for implementing repair and state recovery. For each agent, besides the standard knowledge modality our logic provides an additional modality called hope, which is capable of expressing that the agent is correct (not faulty), and also dynamic modalities enabling change of the agents' correctness status. These dynamic modalities are interpreted as model updates that come in three flavours: fully public, more private, or involving factual change. We provide complete axiomatizations for all these variants in the form of reduction systems: formulas with dynamic modalities are equivalent to formulas without. Therefore, they have the same expressivity as the logic of knowledge and hope. Multiple examples are provided to demonstrate the utility and flexibility of our logic for modeling a wide range of repair and state recovery techniques that have been implemented in the context of fault-detection, isolation, and recovery (FDIR) approaches in fault-tolerant distributed computing with byzantine agents.
