Synthesizing Document Database Queries using Collection Abstractions
Qikang Liu, Yang He, Yanwen Cai, Byeongguk Kwak, Yuepeng Wang
TL;DR
This work tackles automatic synthesis of document database queries from input-output examples by introducing Nosdaq, a domain-specific language aligned with MongoDB's aggregation pipeline and a novel collection abstraction that captures both the document shape and collection size. The central idea is to perform deduction on abstract collections to prune infeasible sketches before enumerating completions, enabling efficient search in a vast space of queries. Empirically, Nosdaq solves $108$ of $110$ benchmarks across diverse sources within a $5$-minute limit, averaging $14.2$ seconds per benchmark, and outperforms baselines including EUSolver and GPT-4o in both success rate and speed. The approach advances practical query synthesis for semi-structured data and offers a scalable path toward automated data extraction tasks in real-world document stores.
Abstract
Document databases are increasingly popular in various applications, but their queries are challenging to write due to the flexible and complex data model underlying document databases. This paper presents a synthesis technique that aims to generate document database queries from input-output examples automatically. A new domain-specific language is designed to express a representative set of document database queries in an algebraic style. Furthermore, the synthesis technique leverages a novel abstraction of collections for deduction to efficiently prune the search space and quickly generate the target query. An evaluation of 110 benchmarks from various sources shows that the proposed technique can synthesize 108 benchmarks successfully. On average, the synthesizer can generate document database queries from a small number of input-output examples within tens of seconds.
