Table of Contents
Fetching ...

One Size Does NOT Fit All: On the Importance of Physical Representations for Datalog Evaluation

Nick Rassau, Felix Schuhknecht

TL;DR

This paper addresses the challenge that Datalog evaluation benefits from a diversity of physical representations rather than a single default. It introduces PlayLog, a flexible, single-node in-memory engine that supports a catalog of 13 combinations of access types and index structures, and uses workload signatures to automatically select representations via decision trees. Through extensive experiments on synthetic and real-world workloads, the authors show that their automatic configuration can closely match hand-tuned implementations and outperform static baselines like Soufflé, demonstrating meaningful speedups and memory trade-offs. The work highlights the importance of aligning representation choices with workload characteristics and provides a practical path toward adaptive Datalog engines.

Abstract

Datalog is an increasingly popular recursive query language that is declarative by design, meaning its programs must be translated by an engine into the actual physical execution plan. When generating this plan, a central decision is how to physically represent all involved relations, an aspect in which existing Datalog engines are surprisingly restrictive and often resort to one-size-fits-all solutions. The reason for this is that the typical execution plan of a Datalog program not only performs a single type of operation against the physical representations, but a mixture of operations, such as insertions, lookups, and containment-checks. Further, the relevance of each operation type highly depends on the workload characteristics, which range from familiar properties such as the size, multiplicity, and arity of the individual relations to very specific Datalog properties, such as the "interweaving" of rules when relations occur multiple times, and in particular the recursiveness of the query which might generate new tuples on the fly during evaluation. This indicates that a variety of physical representations, each with its own strengths and weaknesses, is required to meet the specific needs of different workload situations. To evaluate this, we conduct an in-depth experimental study of the interplay between potentially suitable physical representations and seven dimensions of workload characteristics that vary across actual Datalog programs, revealing which properties actually matter. Based on these insights, we design an automatic selection mechanism that utilizes a set of decision trees to identify suitable physical representations for a given workload.

One Size Does NOT Fit All: On the Importance of Physical Representations for Datalog Evaluation

TL;DR

This paper addresses the challenge that Datalog evaluation benefits from a diversity of physical representations rather than a single default. It introduces PlayLog, a flexible, single-node in-memory engine that supports a catalog of 13 combinations of access types and index structures, and uses workload signatures to automatically select representations via decision trees. Through extensive experiments on synthetic and real-world workloads, the authors show that their automatic configuration can closely match hand-tuned implementations and outperform static baselines like Soufflé, demonstrating meaningful speedups and memory trade-offs. The work highlights the importance of aligning representation choices with workload characteristics and provides a practical path toward adaptive Datalog engines.

Abstract

Datalog is an increasingly popular recursive query language that is declarative by design, meaning its programs must be translated by an engine into the actual physical execution plan. When generating this plan, a central decision is how to physically represent all involved relations, an aspect in which existing Datalog engines are surprisingly restrictive and often resort to one-size-fits-all solutions. The reason for this is that the typical execution plan of a Datalog program not only performs a single type of operation against the physical representations, but a mixture of operations, such as insertions, lookups, and containment-checks. Further, the relevance of each operation type highly depends on the workload characteristics, which range from familiar properties such as the size, multiplicity, and arity of the individual relations to very specific Datalog properties, such as the "interweaving" of rules when relations occur multiple times, and in particular the recursiveness of the query which might generate new tuples on the fly during evaluation. This indicates that a variety of physical representations, each with its own strengths and weaknesses, is required to meet the specific needs of different workload situations. To evaluate this, we conduct an in-depth experimental study of the interplay between potentially suitable physical representations and seven dimensions of workload characteristics that vary across actual Datalog programs, revealing which properties actually matter. Based on these insights, we design an automatic selection mechanism that utilizes a set of decision trees to identify suitable physical representations for a given workload.
Paper Structure (20 sections, 10 figures, 3 tables, 5 algorithms)

This paper contains 20 sections, 10 figures, 3 tables, 5 algorithms.

Figures (10)

  • Figure 1: Scaling behavior of different physical representations.
  • Figure 2: Splitting the join in probe and probe-result iteration while varying the number of duplicates in S.
  • Figure 3: Bulk-loading vs body evaluation.
  • Figure 4: Impact of the key width and the schema width.
  • Figure 5: Rule interweaving with four rules, where relation S occurs in every rule and uses 1, 2, 3, or 4 physical representations.
  • ...and 5 more figures