Table of Contents
Fetching ...

Efficient Fault Tolerance for Pipelined Query Engines via Write-ahead Lineage

Ziheng Wang, Alex Aiken

TL;DR

This work presents write-ahead lineage, a novel fault recovery technique that combines Spark's lineage-based replay and write-ahead logging, and implements it in a distributed pipelined query engine called Quokka.

Abstract

Modern distributed pipelined query engines either do not support intra-query fault tolerance or employ high-overhead approaches such as persisting intermediate outputs or checkpointing state. In this work, we present write-ahead lineage, a novel fault recovery technique that combines Spark's lineage-based replay and write-ahead logging. Unlike Spark, where the lineage is determined before query execution, write-ahead lineage persistently logs lineage at runtime to support dynamic task dependencies in pipelined query engines. Since only KB-sized lineages are persisted instead of MB-sized intermediate outputs, the normal execution overhead is minimal compared to spooling or checkpointing based approaches. To ensure fast fault recovery times, tasks only consume intermediate outputs with persisted lineage, preventing global rollbacks upon failure. In addition, lost tasks from different stages can be recovered in a pipelined parallel manner. We implement write-ahead lineage in a distributed pipelined query engine called Quokka. We show that Quokka is around 2x faster than SparkSQL on the TPC-H benchmark with similar fault recovery performance.

Efficient Fault Tolerance for Pipelined Query Engines via Write-ahead Lineage

TL;DR

This work presents write-ahead lineage, a novel fault recovery technique that combines Spark's lineage-based replay and write-ahead logging, and implements it in a distributed pipelined query engine called Quokka.

Abstract

Modern distributed pipelined query engines either do not support intra-query fault tolerance or employ high-overhead approaches such as persisting intermediate outputs or checkpointing state. In this work, we present write-ahead lineage, a novel fault recovery technique that combines Spark's lineage-based replay and write-ahead logging. Unlike Spark, where the lineage is determined before query execution, write-ahead lineage persistently logs lineage at runtime to support dynamic task dependencies in pipelined query engines. Since only KB-sized lineages are persisted instead of MB-sized intermediate outputs, the normal execution overhead is minimal compared to spooling or checkpointing based approaches. To ensure fast fault recovery times, tasks only consume intermediate outputs with persisted lineage, preventing global rollbacks upon failure. In addition, lost tasks from different stages can be recovered in a pipelined parallel manner. We implement write-ahead lineage in a distributed pipelined query engine called Quokka. We show that Quokka is around 2x faster than SparkSQL on the TPC-H benchmark with similar fault recovery performance.
Paper Structure (29 sections, 11 figures, 1 table, 2 algorithms)

This paper contains 29 sections, 11 figures, 1 table, 2 algorithms.

Figures (11)

  • Figure 1: Pipelined Engine: An input stream of batches is processed by tasks to create an output stream, producing intermediate state variables along the way. The number and size of batches processed by each task can be dynamically determined at runtime.
  • Figure 2: Spooling: persisted data partitions are marked with green boxes. We assume the channel with tasks 0 and 1 has failed. Since task 1 depends on state variable S1, which was not persisted, the whole channel has to restart.
  • Figure 3: Spark employs data parallel recovery, where different tasks of the same stage are assigned to different workers. Quokka conducts pipelined parallel recovery, where different stages are assigned to different workers. We assume worker 2 fails and the dashed lines indicate recovery tasks.
  • Figure 4: Quokka's architecture. Note that instead of having components communicate with each other through RPC calls, all coordination is done through the GCS. The client also communicates with the cluster through the GCS.
  • Figure 5: An example fault recovery procedure when one out of three workers fail. Pink shade represents data partitions that have been generated by past tasks and stored on the TaskManager. Recovery tasks are shaded in light blue.
  • ...and 6 more figures