Table of Contents
Fetching ...

Blend: A Unified Data Discovery System

Mahdi Esmailoghli, Christoph Schnell, Renée J. Miller, Ziawasch Abedjan

TL;DR

Blend introduces a unified data discovery system that enables declarative composition of keyword, join, union, and correlation discovery through a minimal, in-DB AllTables index and a rule- and ML-based plan optimizer. It defines atomic Seeker and Combiner operators, a lightweight discovery language, and a plan-rewriting strategy to push work into the database, achieving faster, more scalable complex pipelines compared to ad-hoc baselines. The approach is validated across diverse data lakes and tasks, showing substantial runtime improvements, reduced code complexity, and effective optimization behavior, with a user study indicating strong practitioner demand for such declarative, optimized pipelines. The work highlights practical impact for accelerating data discovery in large-scale data lakes and outlines future directions toward semantic discovery and embedded vector indexing.

Abstract

Most research on data discovery has so far focused on improving individual discovery operators such as join, correlation, or union discovery. However, in practice, a combination of these techniques and their corresponding indexes may be necessary to support arbitrary discovery tasks. We propose BLEND, a comprehensive data discovery system that supports existing operators and enables their flexible pipelining. BLEND is based on a set of lower-level operators that serve as fundamental building blocks for more complex and sophisticated user tasks. To reduce the execution runtime of discovery pipelines, we propose a unified index structure and a rule-based optimizer that rewrites SQL statements into low-level operators when possible. We show the superior flexibility and efficiency of our system compared to ad-hoc discovery pipelines and stand-alone solutions.

Blend: A Unified Data Discovery System

TL;DR

Blend introduces a unified data discovery system that enables declarative composition of keyword, join, union, and correlation discovery through a minimal, in-DB AllTables index and a rule- and ML-based plan optimizer. It defines atomic Seeker and Combiner operators, a lightweight discovery language, and a plan-rewriting strategy to push work into the database, achieving faster, more scalable complex pipelines compared to ad-hoc baselines. The approach is validated across diverse data lakes and tasks, showing substantial runtime improvements, reduced code complexity, and effective optimization behavior, with a user study indicating strong practitioner demand for such declarative, optimized pipelines. The work highlights practical impact for accelerating data discovery in large-scale data lakes and outlines future directions toward semantic discovery and embedded vector indexing.

Abstract

Most research on data discovery has so far focused on improving individual discovery operators such as join, correlation, or union discovery. However, in practice, a combination of these techniques and their corresponding indexes may be necessary to support arbitrary discovery tasks. We propose BLEND, a comprehensive data discovery system that supports existing operators and enables their flexible pipelining. BLEND is based on a set of lower-level operators that serve as fundamental building blocks for more complex and sophisticated user tasks. To reduce the execution runtime of discovery pipelines, we propose a unified index structure and a rule-based optimizer that rewrites SQL statements into low-level operators when possible. We show the superior flexibility and efficiency of our system compared to ad-hoc discovery pipelines and stand-alone solutions.
Paper Structure (48 sections, 1 theorem, 1 equation, 6 figures, 9 tables)

This paper contains 48 sections, 1 theorem, 1 equation, 6 figures, 9 tables.

Key Result

Theorem 1

Given the proposed combiners and seekers, Blend's optimizer does not alter the output of the query.

Figures (6)

  • Figure 1: Blend's architecture.
  • Figure 2: The lake index is a single relational table: AllTables.
  • Figure 3: Multi-objective discovery plan.
  • Figure 4: Average runtime comparison between Blend and JOSIE for different query sizes ($k$ = $10$).
  • Figure 5: Lakebench experiments.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof