Universal Scalability in Declarative Program Analysis (with Choice-Based Combination Pruning)
Anastasios Antoniadis, Ilias Tsatiris, Nevill Grech, Yannis Smaragdakis
TL;DR
The paper tackles scalability in declarative program analysis by addressing the uncontrolled growth of results in Datalog-based analyses. It introduces Choice-Bound, a simple and universal technique that extensions non-deterministic choice to multiplicity bounds via a counting field and a hash-based identifier, parameterized by $N$. Applied to large frameworks such as Doop and Symbolic Value-Flow (symvalic) for Ethereum, Choice-Bound delivers substantial performance improvements—often exceeding an order of magnitude ($>10\times$)—while incurring only modest completeness loss. The approach yields a flexible design space for tuning precision vs. performance and demonstrates practical impact by enabling previously intractable analyses to run within reasonable time and resources.
Abstract
In this work, we present a simple, uniform, and elegant solution to the problem, with stunning practical effectiveness and application to virtually any Datalog-based analysis. The approach consists of leveraging the choice construct, supported natively in modern Datalog engines like Soufflé. The choice construct allows the definition of functional dependencies in a relation and has been used in the past for expressing worklist algorithms. We show a near-universal construction that allows the choice construct to flexibly limit evaluation of predicates. The technique is applicable to practically any analysis architecture imaginable, since it adaptively prunes evaluation results when a (programmer-controlled) projection of a relation exceeds a desired cardinality. We apply the technique to probably the largest, pre-existing Datalog analysis frameworks in existence: Doop (for Java bytecode) and the main client analyses from the Gigahorse framework (for Ethereum smart contracts). Without needing to understand the existing analysis logic and with minimal, local-only changes, the performance of each framework increases dramatically, by over 20x for the hardest inputs, with near-negligible sacrifice in completeness.
