Distributed Speculative Execution for Resilient Cloud Applications
Tianyu Li, Badrish Chandramouli, Philip A. Bernstein, Samuel Madden
TL;DR
This work tackles the latency bottleneck of durable execution in modern cloud applications by introducing distributed speculative execution (DSE). It presents libDSE, a general-purpose runtime that lets developers write synchronous-persistence-style code while the runtime transparently bypasses persistence on the critical path and performs rollback-based recovery when failures occur. Central contributions include a DPR-inspired speculative protocol with a deterministic, stateless coordinator, a dependency-graph-based recovery mechanism, and a practical programming model built around StateObjects, Actions, and sthreads. The authors implement four speculative services (log, key-value store, workflows, event broker) and demonstrate substantial latency reductions (up to an order of magnitude) with manageable overhead and good scalability across end-to-end workloads and microbenchmarks. Overall, the paper presents a viable path toward significantly lowering fault-tolerance overhead in distributed cloud applications, with broad applicability to workflows, streaming, and distributed primitives.
Abstract
Fault-tolerance is critically important in highly-distributed modern cloud applications. Solutions such as Temporal, Azure Durable Functions, and Beldi hide fault-tolerance complexity from developers by persisting execution state and resuming seamlessly from persisted state after failure. This pattern, often called durable execution, usually forces frequent and synchronous persistence and results in hefty latency overheads. In this paper, we propose distributed speculative execution (DSE), a technique for implementing the durable execution abstraction without incurring this penalty. With DSE, developers write code assuming synchronous persistence, and a DSE runtime is responsible for transparently bypassing persistence and reactively repairing application state on failure. We present libDSE, the first DSE application framework that achieves this vision. The key tension in designing libDSE is between imposing restrictions on user programs so the framework can safely and transparently change execution behavior, and avoiding assumptions so libDSE can support more use cases. We address this with a novel programming model centered around message-passing, atomic code blocks, and lightweight threads, and show that it allows developers to build a variety of speculative services, including write-ahead logs, key-value stores, event brokers, and fault-tolerant workflows. Our evaluation shows that libDSE reduces end-to-end latency by up to an order of magnitude compared to current generations of durable execution systems with minimal run-time overhead and manageable complexity.
