Failure Transparency in Stateful Dataflow Systems (Technical Report)

Aleksey Veresov; Jonas Spenger; Paris Carbone; Philipp Haller

Failure Transparency in Stateful Dataflow Systems (Technical Report)

Aleksey Veresov, Jonas Spenger, Paris Carbone, Philipp Haller

TL;DR

This paper provides a novel definition of failure transparency based on observational explainability, a concept which relates executions according to their observations, and provides a formal proof of failure transparency for the implementation model; i.e., it is proved that the failure-free model correctly abstracts from the failure-related details of the implementation model.

Abstract

Failure transparency enables users to reason about distributed systems at a higher level of abstraction, where complex failure-handling logic is hidden. This is especially true for stateful dataflow systems, which are the backbone of many cloud applications. In particular, this paper focuses on proving failure transparency in Apache Flink, a popular stateful dataflow system. Even though failure transparency is a critical aspect of Apache Flink, to date it has not been formally proven. Showing that the failure transparency mechanism is correct, however, is challenging due to the complexity of the mechanism itself. Nevertheless, this complexity can be effectively hidden behind a failure transparent programming interface. To show that Apache Flink is failure transparent, we model it in small-step operational semantics. Next, we provide a novel definition of failure transparency based on observational explainability, a concept which relates executions according to their observations. Finally, we provide a formal proof of failure transparency for the implementation model; i.e., we prove that the failure-free model correctly abstracts from the failure-related details of the implementation model. We also show liveness of the implementation model under a fair execution assumption. These results are a first step towards a verified stack for stateful dataflow systems.

Failure Transparency in Stateful Dataflow Systems (Technical Report)

TL;DR

Abstract

Paper Structure (32 sections, 10 theorems, 38 equations, 12 figures)

This paper contains 32 sections, 10 theorems, 38 equations, 12 figures.

Introduction
Background
Failures in Distributed Systems
Stateful Dataflow and Apache Flink
Asynchronous Barrier Snapshotting
Basic Notation
Stateful Dataflow
A Taste of Programming in Stateful Dataflow
Failure Recovery via Asynchronous Barrier Snapshotting
The ABS Protocol.
Implementation Model
Streaming Model
Stateful Dataflow Model
Derivation Rules
Assumptions
...and 17 more sections

Key Result

Lemma 7

If $R$ is observationally explainable by $R'$ w.r.t. $O$, $O'$, $T$, then it is also monotonically observationally explainable:

Figures (12)

Figure 1: This work in the context of a fully verified stack for distributed programming.
Figure 2: Example stateful dataflow program calculating the incremental average of a data stream of integers. Another stream is used to transfer control messages resetting the state of the program.
Figure 3: Examples of snapshots obtained in a distributed stateful dataflow system with three processes $p_1 \rightarrow p_2 \rightarrow p_3$.
Figure 4: An execution with failures and its observed execution.
Figure 5: Streaming syntax.
...and 7 more figures

Theorems & Definitions (37)

Definition 0: Action Application
Definition 0: Greatest Common Epoch Number
Definition 0: Output Messages
Definition 1: Messages on a Stream
Definition 1: Latest Common Snapshot
Definition 2: Execution Step
Definition 3: Executions
Definition 4: Observability Function
Definition 5: Observational Explanation
Definition 6: Observational Explainability
...and 27 more

Failure Transparency in Stateful Dataflow Systems (Technical Report)

TL;DR

Abstract

Failure Transparency in Stateful Dataflow Systems (Technical Report)

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (37)