Table of Contents
Fetching ...

CheckMate: Evaluating Checkpointing Protocols for Streaming Dataflows

George Siachamis, Kyriakos Psarakis, Marios Fragkoulis, Arie van Deursen, Paris Carbone, Asterios Katsifodimos

TL;DR

The paper evaluates three checkpointing families—Coordinated Aligned Checkpointing (COOR), Uncoordinated Checkpointing (UNC), and Communication-induced Checkpointing (CIC)—in streaming dataflows using a shared testbed. Through theoretical discussion and an open-source implementation, it demonstrates that COOR typically delivers the best throughput under uniform workloads, but UNC can outperform coordinated approaches under skewed input and cyclic workloads, challenging the default preference for coordination. CIC consistently incurs high messaging overhead and does not outperform the others in most scenarios, though it reduces the domino effect. The work provides actionable guidance for protocol selection and motivates further research into uncoordinated approaches, supported by a capable, neutral testbed for reproducible experiments.

Abstract

Stream processing in the last decade has seen broad adoption in both commercial and research settings. One key element for this success is the ability of modern stream processors to handle failures while ensuring exactly-once processing guarantees. At the moment of writing, virtually all stream processors that guarantee exactly-once processing implement a variant of Apache Flink's coordinated checkpoints - an extension of the original Chandy-Lamport checkpoints from 1985. However, the reasons behind this prevalence of the coordinated approach remain anecdotal, as reported by practitioners of the stream processing community. At the same time, common checkpointing approaches, such as the uncoordinated and the communication-induced ones, remain largely unexplored. This paper is the first to address this gap by i) shedding light on why practitioners have favored the coordinated approach and ii) by investigating whether there are viable alternatives. To this end, we implement three checkpointing approaches that we surveyed and adapted for the distinct needs of streaming dataflows. Our analysis shows that the coordinated approach outperforms the uncoordinated and communication-induced protocols under uniformly distributed workloads. To our surprise, however, the uncoordinated approach is not only competitive to the coordinated one in uniformly distributed workloads, but it also outperforms the coordinated approach in skewed workloads. We conclude that rather than blindly employing coordinated checkpointing, research should focus on optimizing the very promising uncoordinated approach, as it can address issues with skew and support prevalent cyclic queries. We believe that our findings can trigger further research into checkpointing mechanisms.

CheckMate: Evaluating Checkpointing Protocols for Streaming Dataflows

TL;DR

The paper evaluates three checkpointing families—Coordinated Aligned Checkpointing (COOR), Uncoordinated Checkpointing (UNC), and Communication-induced Checkpointing (CIC)—in streaming dataflows using a shared testbed. Through theoretical discussion and an open-source implementation, it demonstrates that COOR typically delivers the best throughput under uniform workloads, but UNC can outperform coordinated approaches under skewed input and cyclic workloads, challenging the default preference for coordination. CIC consistently incurs high messaging overhead and does not outperform the others in most scenarios, though it reduces the domino effect. The work provides actionable guidance for protocol selection and motivates further research into uncoordinated approaches, supported by a capable, neutral testbed for reproducible experiments.

Abstract

Stream processing in the last decade has seen broad adoption in both commercial and research settings. One key element for this success is the ability of modern stream processors to handle failures while ensuring exactly-once processing guarantees. At the moment of writing, virtually all stream processors that guarantee exactly-once processing implement a variant of Apache Flink's coordinated checkpoints - an extension of the original Chandy-Lamport checkpoints from 1985. However, the reasons behind this prevalence of the coordinated approach remain anecdotal, as reported by practitioners of the stream processing community. At the same time, common checkpointing approaches, such as the uncoordinated and the communication-induced ones, remain largely unexplored. This paper is the first to address this gap by i) shedding light on why practitioners have favored the coordinated approach and ii) by investigating whether there are viable alternatives. To this end, we implement three checkpointing approaches that we surveyed and adapted for the distinct needs of streaming dataflows. Our analysis shows that the coordinated approach outperforms the uncoordinated and communication-induced protocols under uniformly distributed workloads. To our surprise, however, the uncoordinated approach is not only competitive to the coordinated one in uniformly distributed workloads, but it also outperforms the coordinated approach in skewed workloads. We conclude that rather than blindly employing coordinated checkpointing, research should focus on optimizing the very promising uncoordinated approach, as it can address issues with skew and support prevalent cyclic queries. We believe that our findings can trigger further research into checkpointing mechanisms.
Paper Structure (16 sections, 13 figures, 4 tables, 1 algorithm)

This paper contains 16 sections, 13 figures, 4 tables, 1 algorithm.

Figures (13)

  • Figure 1: Examples of valid recovery lines when in-flight messages are included in the global state.
  • Figure 2: Cases of inconsistent and consistent state after recovery for stateful operators $O_1, O_2$ and $O_3$.
  • Figure 3: Example execution of the coordinated aligned checkpointing protocol. Messages are represented as circles, and markers are squares. Different colors denote different coordinated rounds.
  • Figure 4: Example overview of Rollback propagation algorithm on a given execution timeline
  • Figure 5: Domino effect of invalid checkpoints on a cyclic query.
  • ...and 8 more figures

Theorems & Definitions (5)

  • Definition 1: At-most-once
  • Definition 2: At-least-once
  • Definition 3: Exactly-once
  • Definition 4: Orphan message
  • Definition 5: Consistent global state