Table of Contents
Fetching ...

Synthesizing Document Database Queries using Collection Abstractions

Qikang Liu, Yang He, Yanwen Cai, Byeongguk Kwak, Yuepeng Wang

TL;DR

This work tackles automatic synthesis of document database queries from input-output examples by introducing Nosdaq, a domain-specific language aligned with MongoDB's aggregation pipeline and a novel collection abstraction that captures both the document shape and collection size. The central idea is to perform deduction on abstract collections to prune infeasible sketches before enumerating completions, enabling efficient search in a vast space of queries. Empirically, Nosdaq solves $108$ of $110$ benchmarks across diverse sources within a $5$-minute limit, averaging $14.2$ seconds per benchmark, and outperforms baselines including EUSolver and GPT-4o in both success rate and speed. The approach advances practical query synthesis for semi-structured data and offers a scalable path toward automated data extraction tasks in real-world document stores.

Abstract

Document databases are increasingly popular in various applications, but their queries are challenging to write due to the flexible and complex data model underlying document databases. This paper presents a synthesis technique that aims to generate document database queries from input-output examples automatically. A new domain-specific language is designed to express a representative set of document database queries in an algebraic style. Furthermore, the synthesis technique leverages a novel abstraction of collections for deduction to efficiently prune the search space and quickly generate the target query. An evaluation of 110 benchmarks from various sources shows that the proposed technique can synthesize 108 benchmarks successfully. On average, the synthesizer can generate document database queries from a small number of input-output examples within tens of seconds.

Synthesizing Document Database Queries using Collection Abstractions

TL;DR

This work tackles automatic synthesis of document database queries from input-output examples by introducing Nosdaq, a domain-specific language aligned with MongoDB's aggregation pipeline and a novel collection abstraction that captures both the document shape and collection size. The central idea is to perform deduction on abstract collections to prune infeasible sketches before enumerating completions, enabling efficient search in a vast space of queries. Empirically, Nosdaq solves of benchmarks across diverse sources within a -minute limit, averaging seconds per benchmark, and outperforms baselines including EUSolver and GPT-4o in both success rate and speed. The approach advances practical query synthesis for semi-structured data and offers a scalable path toward automated data extraction tasks in real-world document stores.

Abstract

Document databases are increasingly popular in various applications, but their queries are challenging to write due to the flexible and complex data model underlying document databases. This paper presents a synthesis technique that aims to generate document database queries from input-output examples automatically. A new domain-specific language is designed to express a representative set of document database queries in an algebraic style. Furthermore, the synthesis technique leverages a novel abstraction of collections for deduction to efficiently prune the search space and quickly generate the target query. An evaluation of 110 benchmarks from various sources shows that the proposed technique can synthesize 108 benchmarks successfully. On average, the synthesizer can generate document database queries from a small number of input-output examples within tens of seconds.

Paper Structure

This paper contains 26 sections, 4 theorems, 34 equations, 15 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

Let $\tilde{\mathcal{D}}$ be an abstract database over schema $\mathcal{S}$, $\Omega$ be a sketch, $\mathcal{Q}$ be a query that is a completion of $\Omega$, and $(I, O)$ be an input-output example, where $\vdash I: \mathcal{S}$ and $\vdash O : \textsl{Arr}\langle \tau_O \rangle$. If $\llbracket{\ma

Figures (15)

  • Figure 1: Schematic workflow.
  • Figure 2: Input example.
  • Figure 3: Schema of document databases.
  • Figure 4: Definition of document databases.
  • Figure 5: Rules for conformance between databases and schemas.
  • ...and 10 more figures

Theorems & Definitions (31)

  • Example 1
  • Example 2
  • Example 3
  • Definition 1: Input-output example
  • Definition 2: Abstract collection
  • Definition 3: Abstract database
  • Definition 4: Placeholder
  • Definition 5: Augmented type
  • Example 4
  • Definition 6: Abstract collection with placeholders
  • ...and 21 more