Table of Contents
Fetching ...

UQE: A Query Engine for Unstructured Databases

Hanjun Dai, Bethany Yixin Wang, Xingchen Wan, Bo Dai, Sherry Yang, Azade Nova, Pengcheng Yin, Phitchaya Mangpo Phothilimthana, Charles Sutton, Dale Schuurmans

TL;DR

A new Universal Query Engine (UQE) is proposed that directly interrogates and draws insights from unstructured data collections, and borrows techniques from classical compiler theory to better orchestrate the workflow between sampling methods and foundation model calls.

Abstract

Analytics on structured data is a mature field with many successful methods. However, most real world data exists in unstructured form, such as images and conversations. We investigate the potential of Large Language Models (LLMs) to enable unstructured data analytics. In particular, we propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections. This engine accepts queries in a Universal Query Language (UQL), a dialect of SQL that provides full natural language flexibility in specifying conditions and operators. The new engine leverages the ability of LLMs to conduct analysis of unstructured data, while also allowing us to exploit advances in sampling and optimization techniques to achieve efficient and accurate query execution. In addition, we borrow techniques from classical compiler theory to better orchestrate the workflow between sampling methods and foundation model calls. We demonstrate the efficiency of UQE on data analytics across different modalities, including images, dialogs and reviews, across a range of useful query types, including conditional aggregation, semantic retrieval and abstraction aggregation.

UQE: A Query Engine for Unstructured Databases

TL;DR

A new Universal Query Engine (UQE) is proposed that directly interrogates and draws insights from unstructured data collections, and borrows techniques from classical compiler theory to better orchestrate the workflow between sampling methods and foundation model calls.

Abstract

Analytics on structured data is a mature field with many successful methods. However, most real world data exists in unstructured form, such as images and conversations. We investigate the potential of Large Language Models (LLMs) to enable unstructured data analytics. In particular, we propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections. This engine accepts queries in a Universal Query Language (UQL), a dialect of SQL that provides full natural language flexibility in specifying conditions and operators. The new engine leverages the ability of LLMs to conduct analysis of unstructured data, while also allowing us to exploit advances in sampling and optimization techniques to achieve efficient and accurate query execution. In addition, we borrow techniques from classical compiler theory to better orchestrate the workflow between sampling methods and foundation model calls. We demonstrate the efficiency of UQE on data analytics across different modalities, including images, dialogs and reviews, across a range of useful query types, including conditional aggregation, semantic retrieval and abstraction aggregation.
Paper Structure (25 sections, 1 theorem, 3 equations, 5 figures, 5 tables, 2 algorithms)

This paper contains 25 sections, 1 theorem, 3 equations, 5 figures, 5 tables, 2 algorithms.

Key Result

Proposition 1

The optimal proposal distribution $p$ that minimizes the variance of estimation in Eq eq:is is $p_i \propto f({\mathcal{T}}_i, \texttt{cond})$, which achieves zero variance.

Figures (5)

  • Figure 1: Illustration of unstructured data analysis defined in Section \ref{['sec:problem']}.
  • Figure 2: Aggregation (left) v.s. Non-aggregation (right) queries written in UQL.
  • Figure 3: UQL compiler, in analogy to a typical C++ program compiler.
  • Figure 4: Variance of different sampling approaches for aggregation queries over 3 text datasets.
  • Figure 5: Recall (moving average with window size 16) against the number of iterations on (from left to right) AirDialog with condition {cancel, no_flight} and Clevr with {obj_count < 4, #spheres > 3}. Colored lines and shades denote median and interquartile ranges across 8 independent queries and gray lines denote individual queries. The gray dashed lines denote the fraction of the positive population in the entire dataset.

Theorems & Definitions (1)

  • Proposition 1