Table of Contents
Fetching ...

A model of actors and grey failures

Laura Bocchi, Julien Lange, Simon Thompson, A. Laura Voinea

TL;DR

The paper tackles the challenge of grey failures in distributed actor systems by proposing a two-layer formal framework: an asynchronous, time-aware actor calculus and a separate failure layer that injects grey failures with temporal patterns. It introduces weak barbed bisimulation as the core behavioural equivalence to compare cursed and uncursed systems, and defines resilience, recoverability, and augmentation to reason about fault tolerance. Key contributions include extending the conference model with checkpointing, refining n-recoverability, proving resilience equivalence to 0-recoverability, and developing scoped barbs for practical analysis of patterns like circuit breakers. The work provides a foundation for verifying and synthesising fault-tolerant actor-based systems, with prospective tool support (e.g., UPPAAL encoding) and applications in industry patterns and testing frameworks.

Abstract

Existing models for the analysis of concurrent processes tend to focus on fail-stop failures, where processes are either working or permanently stopped, and their state (working/stopped) is known. In fact, systems are often affected by grey failures: failures that are latent, possibly transient, and may affect the system in subtle ways that later lead to major issues (such as crashes, limited availability, overload). We introduce a model of actor-based systems with grey failures, based on two interlinked layers: an actor model, given as an asynchronous process calculus with discrete time, and a failure model that represents failure patterns to inject in the system. Our failure model captures not only fail-stop node and link failures, but also grey failures (e.g., partial, transient). We give a behavioural equivalence relation based on weak barbed bisimulation to compare systems on the basis of their ability to recover from failures, and on this basis we define some desirable properties of reliable systems. By doing so, we reduce the problem of checking reliability properties of systems to the problem of checking bisimulation.

A model of actors and grey failures

TL;DR

The paper tackles the challenge of grey failures in distributed actor systems by proposing a two-layer formal framework: an asynchronous, time-aware actor calculus and a separate failure layer that injects grey failures with temporal patterns. It introduces weak barbed bisimulation as the core behavioural equivalence to compare cursed and uncursed systems, and defines resilience, recoverability, and augmentation to reason about fault tolerance. Key contributions include extending the conference model with checkpointing, refining n-recoverability, proving resilience equivalence to 0-recoverability, and developing scoped barbs for practical analysis of patterns like circuit breakers. The work provides a foundation for verifying and synthesising fault-tolerant actor-based systems, with prospective tool support (e.g., UPPAAL encoding) and applications in industry patterns and testing frameworks.

Abstract

Existing models for the analysis of concurrent processes tend to focus on fail-stop failures, where processes are either working or permanently stopped, and their state (working/stopped) is known. In fact, systems are often affected by grey failures: failures that are latent, possibly transient, and may affect the system in subtle ways that later lead to major issues (such as crashes, limited availability, overload). We introduce a model of actor-based systems with grey failures, based on two interlinked layers: an actor model, given as an asynchronous process calculus with discrete time, and a failure model that represents failure patterns to inject in the system. Our failure model captures not only fail-stop node and link failures, but also grey failures (e.g., partial, transient). We give a behavioural equivalence relation based on weak barbed bisimulation to compare systems on the basis of their ability to recover from failures, and on this basis we define some desirable properties of reliable systems. By doing so, we reduce the problem of checking reliability properties of systems to the problem of checking bisimulation.
Paper Structure (24 sections, 10 theorems, 42 equations, 3 figures)

This paper contains 24 sections, 10 theorems, 42 equations, 3 figures.

Key Result

Lemma 4.7

If $\mathbf{R}$ is time-coherent and $\mathbf{R} \,\xrightarrow{}\, \mathbf{R}'$ then $\mathbf{R}'$ is time-coherent.

Figures (3)

  • Figure 1: Syntax
  • Figure 2: Matching rules
  • Figure 3: Reduction rules

Theorems & Definitions (48)

  • Example 3.1: Permanent and transient failures
  • Remark 4.1
  • Definition 4.2: Cursed system
  • Definition 4.3: Operational semantics for cursed systems
  • Definition 4.4: Time of a system
  • Definition 4.5: Time coherence
  • Definition 4.6: Initial system
  • Lemma 4.7: Time-coherence invariant
  • Proposition 4.8
  • Definition 4.9: Non-instantaneous process
  • ...and 38 more