Blend: A Unified Data Discovery System

Mahdi Esmailoghli; Christoph Schnell; Renée J. Miller; Ziawasch Abedjan

Blend: A Unified Data Discovery System

Mahdi Esmailoghli, Christoph Schnell, Renée J. Miller, Ziawasch Abedjan

TL;DR

Blend introduces a unified data discovery system that enables declarative composition of keyword, join, union, and correlation discovery through a minimal, in-DB AllTables index and a rule- and ML-based plan optimizer. It defines atomic Seeker and Combiner operators, a lightweight discovery language, and a plan-rewriting strategy to push work into the database, achieving faster, more scalable complex pipelines compared to ad-hoc baselines. The approach is validated across diverse data lakes and tasks, showing substantial runtime improvements, reduced code complexity, and effective optimization behavior, with a user study indicating strong practitioner demand for such declarative, optimized pipelines. The work highlights practical impact for accelerating data discovery in large-scale data lakes and outlines future directions toward semantic discovery and embedded vector indexing.

Abstract

Most research on data discovery has so far focused on improving individual discovery operators such as join, correlation, or union discovery. However, in practice, a combination of these techniques and their corresponding indexes may be necessary to support arbitrary discovery tasks. We propose BLEND, a comprehensive data discovery system that supports existing operators and enables their flexible pipelining. BLEND is based on a set of lower-level operators that serve as fundamental building blocks for more complex and sophisticated user tasks. To reduce the execution runtime of discovery pipelines, we propose a unified index structure and a rule-based optimizer that rewrites SQL statements into low-level operators when possible. We show the superior flexibility and efficiency of our system compared to ad-hoc discovery pipelines and stand-alone solutions.

Blend: A Unified Data Discovery System

TL;DR

Abstract

Paper Structure (48 sections, 1 theorem, 1 equation, 6 figures, 9 tables)

This paper contains 48 sections, 1 theorem, 1 equation, 6 figures, 9 tables.

Problem Statement
System Overview
Operators
Seeker operators
Single-Column ($\mathit{SC}$) Seeker
Keyword ($\mathit{KW}$) Seeker
Multi-Column ($\mathit{MC}$) Seeker
Correlation ($\mathit{C}$) Seeker
Combiner operators
Discovery Language Grammar
Index
Seeker Implementations
Defining and Optimizing Discovery Plans
Composing discovery tasks with Blend
Plan Optimization and Execution
...and 33 more sections

Key Result

Theorem 1

Given the proposed combiners and seekers, Blend's optimizer does not alter the output of the query.

Figures (6)

Figure 1: Blend's architecture.
Figure 2: The lake index is a single relational table: AllTables.
Figure 3: Multi-objective discovery plan.
Figure 4: Average runtime comparison between Blend and JOSIE for different query sizes ($k$ = $10$).
Figure 5: Lakebench experiments.
...and 1 more figures

Theorems & Definitions (2)

Theorem 1
proof

Blend: A Unified Data Discovery System

TL;DR

Abstract

Blend: A Unified Data Discovery System

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (2)