A model of actors and grey failures

Laura Bocchi; Julien Lange; Simon Thompson; A. Laura Voinea

A model of actors and grey failures

Laura Bocchi, Julien Lange, Simon Thompson, A. Laura Voinea

TL;DR

The paper tackles the challenge of grey failures in distributed actor systems by proposing a two-layer formal framework: an asynchronous, time-aware actor calculus and a separate failure layer that injects grey failures with temporal patterns. It introduces weak barbed bisimulation as the core behavioural equivalence to compare cursed and uncursed systems, and defines resilience, recoverability, and augmentation to reason about fault tolerance. Key contributions include extending the conference model with checkpointing, refining n-recoverability, proving resilience equivalence to 0-recoverability, and developing scoped barbs for practical analysis of patterns like circuit breakers. The work provides a foundation for verifying and synthesising fault-tolerant actor-based systems, with prospective tool support (e.g., UPPAAL encoding) and applications in industry patterns and testing frameworks.

Abstract

Existing models for the analysis of concurrent processes tend to focus on fail-stop failures, where processes are either working or permanently stopped, and their state (working/stopped) is known. In fact, systems are often affected by grey failures: failures that are latent, possibly transient, and may affect the system in subtle ways that later lead to major issues (such as crashes, limited availability, overload). We introduce a model of actor-based systems with grey failures, based on two interlinked layers: an actor model, given as an asynchronous process calculus with discrete time, and a failure model that represents failure patterns to inject in the system. Our failure model captures not only fail-stop node and link failures, but also grey failures (e.g., partial, transient). We give a behavioural equivalence relation based on weak barbed bisimulation to compare systems on the basis of their ability to recover from failures, and on this basis we define some desirable properties of reliable systems. By doing so, we reduce the problem of checking reliability properties of systems to the problem of checking bisimulation.

A model of actors and grey failures

TL;DR

Abstract

Paper Structure (24 sections, 10 theorems, 42 equations, 3 figures)

This paper contains 24 sections, 10 theorems, 42 equations, 3 figures.

Introduction
Synopsis
Extensions with respect to the conference paper
Informal overview
Mailboxes.
Assumptions.
A model of failures
Calculus for cursed systems
Time actions.
Failure actions.
Runtime System actions.
Basic properties of systems reductions
Reset vs Checkpointing Systems
Properties of cursed systems
Recoverability for reset-systems
...and 9 more sections

Key Result

Lemma 4.7

If $\mathbf{R}$ is time-coherent and $\mathbf{R} \,\xrightarrow{}\, \mathbf{R}'$ then $\mathbf{R}'$ is time-coherent.

Figures (3)

Figure 1: Syntax
Figure 2: Matching rules
Figure 3: Reduction rules

Theorems & Definitions (48)

Example 3.1: Permanent and transient failures
Remark 4.1
Definition 4.2: Cursed system
Definition 4.3: Operational semantics for cursed systems
Definition 4.4: Time of a system
Definition 4.5: Time coherence
Definition 4.6: Initial system
Lemma 4.7: Time-coherence invariant
Proposition 4.8
Definition 4.9: Non-instantaneous process
...and 38 more

A model of actors and grey failures

TL;DR

Abstract

A model of actors and grey failures

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (48)