Finding Cross-rule Optimization Bugs in Datalog Engines
Chi Zhang, Linzhang Wang, Manuel Rigger
TL;DR
The paper tackles the challenge of detecting cross-rule optimization bugs in Datalog engines, where multiple rules can be optimized together in ways that yield incorrect results. It introduces Incremental Rule Evaluation (IRE), a black-box approach that constructs a reference, unoptimized program by evaluating rules individually and incrementally while comparing its results to those of the engine under test running the optimized program. Implemented as the Deopt tool, the method was evaluated on four mature engines (Soufflé, CozoDB, μZ, and DDlog), discovering 30 bugs (including 13 logic bugs) and showing that Deopt can detect all bugs found by queryFuzz, with several additional bugs beyond queryFuzz’s reach. The evaluation also demonstrates efficiency gains in test-case generation relative to naive methods, with substantial improvements in the number of valid, non-empty test cases as test-case size grows. Overall, the proposed Incremental Rule Evaluation framework provides a simple, general, and effective pathway to uncover optimization-related bugs in Datalog engines, with practical impact for developers and researchers alike.
Abstract
Datalog is a popular and widely-used declarative logic programming language. Datalog engines apply many cross-rule optimizations; bugs in them can cause incorrect results. To detect such optimization bugs, we propose an automated testing approach called Incremental Rule Evaluation (IRE), which synergistically tackles the test oracle and test case generation problem. The core idea behind the test oracle is to compare the results of an optimized program and a program without cross-rule optimization; any difference indicates a bug in the Datalog engine. Our core insight is that, for an optimized, incrementally-generated Datalog program, we can evaluate all rules individually by constructing a reference program to disable the optimizations that are performed among multiple rules. Incrementally generating test cases not only allows us to apply the test oracle for every new rule generated-we also can ensure that every newly added rule generates a non-empty result with a given probability and eschew recomputing already-known facts. We implemented IRE as a tool named Deopt, and evaluated Deopt on four mature Datalog engines, namely Soufflé, CozoDB, $μ$Z, and DDlog, and discovered a total of 30 bugs. Of these, 13 were logic bugs, while the remaining were crash and error bugs. Deopt can detect all bugs found by queryFuzz, a state-of-the-art approach. Out of the bugs identified by Deopt, queryFuzz might be unable to detect 5. Our incremental test case generation approach is efficient; for example, for test cases containing 60 rules, our incremental approach can produce 1.17$\times$ (for DDlog) to 31.02$\times$ (for Soufflé) as many valid test cases with non-empty results as the naive random method. We believe that the simplicity and the generality of the approach will lead to its wide adoption in practice.
