Table of Contents
Fetching ...

FlowLog: Efficient and Extensible Datalog via Incrementality

Hangdong Zhao, Zhenghong Yu, Srinag Rao, Simon Frisk, Zhiwei Fan, Paraschos Koutris

TL;DR

FlowLog tackles the long-standing trade-off between efficiency and extensibility in Datalog systems by decoupling recursive control from per-rule logic via an explicit per-rule IR. The IR is lowered to Differential Dataflow, enabling incremental, scalable execution while allowing IR-level optimizations such as logic fusion, subplan sharing, and a robust, sip-based prefiltering strategy for recursive workloads. Key contributions include a structural cost model and a JST-based search for join orders, plan execution tailored to a bushy, parallel dataflow, and several optimization hooks (Boolean specialization, semijoin prefilters, and cross-rule sharing) that collectively improve memory, latency, and scalability. Empirical results across a broad benchmark suite show FlowLog outperforming state-of-the-art Datalog engines and several modern databases on many recursive workloads, with strong parallel scaling and robust performance under diverse conditions.

Abstract

Datalog-based languages are regaining popularity as a powerful abstraction for expressing recursive computations in domains such as program analysis and graph processing. However, existing systems often face a trade-off between efficiency and extensibility. Engines like Souffle achieve high efficiency through domain-specific designs, but lack general-purpose flexibility. Others, like RecStep, offer modularity by layering Datalog on traditional databases, but struggle to integrate Datalog-specific optimizations. This paper bridges this gap by presenting FlowLog, a new Datalog engine that uses an explicit relational IR per-rule to cleanly separate recursive control (e.g., semi-naive execution) from each rule's logical plan. This boundary lets us retain fine-grained, Datalog-aware optimizations at the logical layer, but also reuse off-the-shelf database primitives at execution. At the logical level (i.e. IR), we apply proven SQL optimizations, such as logic fusion and subplan reuse. To address high volatility in recursive workloads, we adopt a robustness-first approach that pairs a structural optimizer (avoiding worst-case joins) with sideways information passing (early filtering). Built atop Differential Dataflow--a mature framework for streaming analytics--FlowLog supports both batch and incremental Datalog and adds novel recursion-aware optimizations called Boolean (or algebraic) specialization. Our evaluation shows that FlowLog outperforms state-of-the-art Datalog engines and modern databases across a broad range of recursive workloads, achieving superior scalability while preserving a simple and extensible architecture.

FlowLog: Efficient and Extensible Datalog via Incrementality

TL;DR

FlowLog tackles the long-standing trade-off between efficiency and extensibility in Datalog systems by decoupling recursive control from per-rule logic via an explicit per-rule IR. The IR is lowered to Differential Dataflow, enabling incremental, scalable execution while allowing IR-level optimizations such as logic fusion, subplan sharing, and a robust, sip-based prefiltering strategy for recursive workloads. Key contributions include a structural cost model and a JST-based search for join orders, plan execution tailored to a bushy, parallel dataflow, and several optimization hooks (Boolean specialization, semijoin prefilters, and cross-rule sharing) that collectively improve memory, latency, and scalability. Empirical results across a broad benchmark suite show FlowLog outperforming state-of-the-art Datalog engines and several modern databases on many recursive workloads, with strong parallel scaling and robust performance under diverse conditions.

Abstract

Datalog-based languages are regaining popularity as a powerful abstraction for expressing recursive computations in domains such as program analysis and graph processing. However, existing systems often face a trade-off between efficiency and extensibility. Engines like Souffle achieve high efficiency through domain-specific designs, but lack general-purpose flexibility. Others, like RecStep, offer modularity by layering Datalog on traditional databases, but struggle to integrate Datalog-specific optimizations. This paper bridges this gap by presenting FlowLog, a new Datalog engine that uses an explicit relational IR per-rule to cleanly separate recursive control (e.g., semi-naive execution) from each rule's logical plan. This boundary lets us retain fine-grained, Datalog-aware optimizations at the logical layer, but also reuse off-the-shelf database primitives at execution. At the logical level (i.e. IR), we apply proven SQL optimizations, such as logic fusion and subplan reuse. To address high volatility in recursive workloads, we adopt a robustness-first approach that pairs a structural optimizer (avoiding worst-case joins) with sideways information passing (early filtering). Built atop Differential Dataflow--a mature framework for streaming analytics--FlowLog supports both batch and incremental Datalog and adds novel recursion-aware optimizations called Boolean (or algebraic) specialization. Our evaluation shows that FlowLog outperforms state-of-the-art Datalog engines and modern databases across a broad range of recursive workloads, achieving superior scalability while preserving a simple and extensible architecture.

Paper Structure

This paper contains 22 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: System Architecture of FlowLog
  • Figure 2: Logic Fusion for $r_2$ from Example \ref{['ex:reach']}
  • Figure 3: A rooted JST for $r_2$ in Example \ref{['ex:reach']} (left) and its translated IR (right) following a post order traversal of the rooted JST.
  • Figure 4: The doop rule for Example \ref{['ex:doop']} (left); the rooted JST chosen by the optimizer over the cyclic join graph (center, numbered post-order); corresponding IR following the post-order (right). Semijoin Reach(inm) is pushed to LoadArrayIdx; Jn is a shorthand for Join-FlatMap.
  • Figure 5: Subplan sharing within IR of Fig. \ref{['fig:jst']} (reused Map are shaded)
  • ...and 3 more figures