Table of Contents
Fetching ...

CHORUS: Foundation Models for Unified Data Discovery and Exploration

Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, Dan Suciu

TL;DR

This paper demonstrates that foundation models can effectively unify data discovery tasks in data lakes and warehouses. By introducing Chorus, a prompting-based framework with post-processing and anchoring, the authors tackle table-class detection, column-type annotation, and join-column prediction, achieving superior results over task-specific baselines and often rivaling human experts. The approach emphasizes zero-/few-shot prompting, cross-task information flow, and robust risk mitigations, including domain-specific anchoring. Empirical results across multiple benchmarks, contamination checks, and ablations establish the method's robustness, scalability, and potential to transform data discovery workflows in practice.

Abstract

We apply foundation models to data discovery and exploration tasks. Foundation models include large language models (LLMs) that show promising performance on a range of diverse tasks unrelated to their training. We show that these models are highly applicable to the data discovery and data exploration domain. When carefully used, they have superior capability on three representative tasks: table-class detection, column-type annotation and join-column prediction. On all three tasks, we show that a foundation-model-based approach outperforms the task-specific models and so the state of the art. Further, our approach often surpasses human-expert task performance. We investigate the fundamental characteristics of this approach including generalizability to several foundation models and the impact of non-determinism on the outputs. All in all, this suggests a future direction in which disparate data management tasks can be unified under foundation models.

CHORUS: Foundation Models for Unified Data Discovery and Exploration

TL;DR

This paper demonstrates that foundation models can effectively unify data discovery tasks in data lakes and warehouses. By introducing Chorus, a prompting-based framework with post-processing and anchoring, the authors tackle table-class detection, column-type annotation, and join-column prediction, achieving superior results over task-specific baselines and often rivaling human experts. The approach emphasizes zero-/few-shot prompting, cross-task information flow, and robust risk mitigations, including domain-specific anchoring. Empirical results across multiple benchmarks, contamination checks, and ablations establish the method's robustness, scalability, and potential to transform data discovery workflows in practice.

Abstract

We apply foundation models to data discovery and exploration tasks. Foundation models include large language models (LLMs) that show promising performance on a range of diverse tasks unrelated to their training. We show that these models are highly applicable to the data discovery and data exploration domain. When carefully used, they have superior capability on three representative tasks: table-class detection, column-type annotation and join-column prediction. On all three tasks, we show that a foundation-model-based approach outperforms the task-specific models and so the state of the art. Further, our approach often surpasses human-expert task performance. We investigate the fundamental characteristics of this approach including generalizability to several foundation models and the impact of non-determinism on the outputs. All in all, this suggests a future direction in which disparate data management tasks can be unified under foundation models.
Paper Structure (45 sections, 6 figures, 7 tables)

This paper contains 45 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Data discovery tasks considered in this work. Given an ontology, such as DBPedia, ① we assign an overall type to the table and ② we annotate the columns with semantic types. Last, given another table, ③ we predict the join column. The user provides the data while chorus interacts with the foundation model. Data from Licensing:2023aa, full prompts in Figure \ref{['fig:prompts']}.
  • Figure 2: Chorus system architecture.
  • Figure 3: Prompts used in this paper, materialized with examples. Most prompt elements are fixed---only the foocyan!30 data sample and fooorange!30 metadata change for each instance.
  • Figure 4: Anchoring illustrated. The LLM hallucinates an imagined label, iucnStatus. Under the standard approach, this poisons all the upcoming tasks; the nearest-neighbor post-processing cannot recover and outputs the incorrect label animal. With anchoring, chorus intervenes when the first error is detected. A new conversation is started and a synthesized (false) history is provided to the LLM, in which it did not make the mistake. With only clean inputs, LLM is able to correctly answer the next task correctly: binomial.
  • Figure 5: Determinism vs. performance. We conduct 25 runs of chorus on the T2D table class benchmark. Shaded bands indicate confidence intervals. Temperature is a parameter controlling the randomness of the foundation model, with zero being the most (but not completely) deterministic.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Definition 2.1: ① Table-class detection
  • Definition 2.2: ② Column-type annotation
  • Definition 2.3: ③ Join-column prediction