Table of Contents
Fetching ...

Failure Transparency in Stateful Dataflow Systems (Technical Report)

Aleksey Veresov, Jonas Spenger, Paris Carbone, Philipp Haller

TL;DR

This paper provides a novel definition of failure transparency based on observational explainability, a concept which relates executions according to their observations, and provides a formal proof of failure transparency for the implementation model; i.e., it is proved that the failure-free model correctly abstracts from the failure-related details of the implementation model.

Abstract

Failure transparency enables users to reason about distributed systems at a higher level of abstraction, where complex failure-handling logic is hidden. This is especially true for stateful dataflow systems, which are the backbone of many cloud applications. In particular, this paper focuses on proving failure transparency in Apache Flink, a popular stateful dataflow system. Even though failure transparency is a critical aspect of Apache Flink, to date it has not been formally proven. Showing that the failure transparency mechanism is correct, however, is challenging due to the complexity of the mechanism itself. Nevertheless, this complexity can be effectively hidden behind a failure transparent programming interface. To show that Apache Flink is failure transparent, we model it in small-step operational semantics. Next, we provide a novel definition of failure transparency based on observational explainability, a concept which relates executions according to their observations. Finally, we provide a formal proof of failure transparency for the implementation model; i.e., we prove that the failure-free model correctly abstracts from the failure-related details of the implementation model. We also show liveness of the implementation model under a fair execution assumption. These results are a first step towards a verified stack for stateful dataflow systems.

Failure Transparency in Stateful Dataflow Systems (Technical Report)

TL;DR

This paper provides a novel definition of failure transparency based on observational explainability, a concept which relates executions according to their observations, and provides a formal proof of failure transparency for the implementation model; i.e., it is proved that the failure-free model correctly abstracts from the failure-related details of the implementation model.

Abstract

Failure transparency enables users to reason about distributed systems at a higher level of abstraction, where complex failure-handling logic is hidden. This is especially true for stateful dataflow systems, which are the backbone of many cloud applications. In particular, this paper focuses on proving failure transparency in Apache Flink, a popular stateful dataflow system. Even though failure transparency is a critical aspect of Apache Flink, to date it has not been formally proven. Showing that the failure transparency mechanism is correct, however, is challenging due to the complexity of the mechanism itself. Nevertheless, this complexity can be effectively hidden behind a failure transparent programming interface. To show that Apache Flink is failure transparent, we model it in small-step operational semantics. Next, we provide a novel definition of failure transparency based on observational explainability, a concept which relates executions according to their observations. Finally, we provide a formal proof of failure transparency for the implementation model; i.e., we prove that the failure-free model correctly abstracts from the failure-related details of the implementation model. We also show liveness of the implementation model under a fair execution assumption. These results are a first step towards a verified stack for stateful dataflow systems.
Paper Structure (32 sections, 10 theorems, 38 equations, 12 figures)

This paper contains 32 sections, 10 theorems, 38 equations, 12 figures.

Key Result

Lemma 7

If $R$ is observationally explainable by $R'$ w.r.t. $O$, $O'$, $T$, then it is also monotonically observationally explainable:

Figures (12)

  • Figure 1: This work in the context of a fully verified stack for distributed programming.
  • Figure 2: Example stateful dataflow program calculating the incremental average of a data stream of integers. Another stream is used to transfer control messages resetting the state of the program.
  • Figure 3: Examples of snapshots obtained in a distributed stateful dataflow system with three processes $p_1 \rightarrow p_2 \rightarrow p_3$.
  • Figure 4: An execution with failures and its observed execution.
  • Figure 5: Streaming syntax.
  • ...and 7 more figures

Theorems & Definitions (37)

  • Definition 0: Action Application
  • Definition 0: Greatest Common Epoch Number
  • Definition 0: Output Messages
  • Definition 1: Messages on a Stream
  • Definition 1: Latest Common Snapshot
  • Definition 2: Execution Step
  • Definition 3: Executions
  • Definition 4: Observability Function
  • Definition 5: Observational Explanation
  • Definition 6: Observational Explainability
  • ...and 27 more