Table of Contents
Fetching ...

Efficient Learned Query Execution over Text and Tables [Technical Report]

Matthias Urban, Carsten Binnig

TL;DR

ELEET is presented, a novel execution engine that allows one to seamlessly query and process text as a first-class citizen along with tables along with tables and can speed up multi-modal queries over tables and text by up to 575x without sacrificing accuracy.

Abstract

In this paper, we present ELEET, a novel execution engine that allows one to seamlessly query and process text as a first-class citizen along with tables. To enable such a seamless integration of text and tables, ELEET leverages learned multi-modal operators (MMOps) such as joins and unions that seamlessly combine structured with unstructured textual data. While large language models (LLM) such as GPT-4 are interesting candidates to enable such learned multimodal operations, we deliberately do not follow this trend to enable MMOps, since it would result in high overhead at query runtime. Instead, to enable MMOps, ELEET comes with a more efficient small language model (SLM) that is targeted to extract structured data from text. Thanks to our novel architecture and pre-training procedure, the ELEET-model enables high-accuracy extraction with low overheads. In our evaluation, we compare query execution based on ELEET to baselines leveraging LLMs such as GPT-4 and show that ELEET can speed up multi-modal queries over tables and text by up to 575x without sacrificing accuracy.

Efficient Learned Query Execution over Text and Tables [Technical Report]

TL;DR

ELEET is presented, a novel execution engine that allows one to seamlessly query and process text as a first-class citizen along with tables along with tables and can speed up multi-modal queries over tables and text by up to 575x without sacrificing accuracy.

Abstract

In this paper, we present ELEET, a novel execution engine that allows one to seamlessly query and process text as a first-class citizen along with tables. To enable such a seamless integration of text and tables, ELEET leverages learned multi-modal operators (MMOps) such as joins and unions that seamlessly combine structured with unstructured textual data. While large language models (LLM) such as GPT-4 are interesting candidates to enable such learned multimodal operations, we deliberately do not follow this trend to enable MMOps, since it would result in high overhead at query runtime. Instead, to enable MMOps, ELEET comes with a more efficient small language model (SLM) that is targeted to extract structured data from text. Thanks to our novel architecture and pre-training procedure, the ELEET-model enables high-accuracy extraction with low overheads. In our evaluation, we compare query execution based on ELEET to baselines leveraging LLMs such as GPT-4 and show that ELEET can speed up multi-modal queries over tables and text by up to 575x without sacrificing accuracy.

Paper Structure

This paper contains 36 sections, 8 equations, 19 figures, 5 tables, 5 algorithms.

Figures (19)

  • Figure 1: Example of a query that executes a multi-modal join between a patient table and examination reports. ELEET analyzes the texts and extracts values for each queried attribute, such as the diagnosis from each examination report.
  • Figure 2: Overview of ELEET. In an offline phase, the ELEET-model can be fine-tuned for unseen domains ①. Fine-tuning the ELEET-model for an unseen domain is a one-time effort and requires a small sample of a few labeled texts. ② For query execution, ELEET uses multi-modal query plans that contain traditional (white) and multi-modal database operators (purple). To compute the result of a multi-modal operation such as a join over texts, the ELEET-model is used (see ⓐ to ⓒ): During the execution of a multi-modal operation, the ELEET model first computes embeddings of the query attributes, texts, and table input ⓐ, using its encoder ⓑ. Afterwards, the ELEET-model matches text token embeddings to query attribute embeddings to extract the output table from the text using its extractive decoder ⓒ, which decides which tokens qualify for a given query attribute.
  • Figure 3: Model architecture. After encoding a batch of sequences that each contain an input text, the latent attributes, and traditional table values ① using 12 (11+1) transformer layers ② ⓐ, all embeddings corresponding to the same cells or latent attributes are pooled, before vertical attention lets signal flow between groups of k rows ⓑ. A separate final transformer layer computes a second set of text embeddings optimized for detecting duplicates ⓒ. The decoder ③ consists of several heads for the different sub-tasks for extracting table data from texts. For instance, the row-detect head is used to find extractions in the text. For this, it pairs the embedding of each text token with the embedding of a masked cell (i.e., the attribute to be extracted) and classifies whether the token is part of the attribute or not. The tokens that are marked to be extracted are inserted into the output table ④.
  • Figure 4: Besides the join with a single-row latent table ①, there are two cases for multi-row latent tables ②. In the first case, multiple tuples need to be extracted per table row of the tabular operand (a). An interesting special case is when for each table tuple, only a single tuple needs to be extracted (b). Joins allow for several automatic optimizations, depending on the case. For example, texts not having a join partner can be skipped during extraction, which is particularly beneficial when the table has been filtered before the join. Moreover, in the case of (b), there is no need to run Algorithm \ref{['alg:complex']}.
  • Figure 5: The multi-modal index used for selections. When extracting the values stored in the index from the texts (using attribute-detect), we use the duplicate-detect head to identify values that refer to the same concept. This allows the index to return texts of patients that have a fever when users query for patients with high body temperature.
  • ...and 14 more figures