Blend: A Unified Data Discovery System
Mahdi Esmailoghli, Christoph Schnell, Renée J. Miller, Ziawasch Abedjan
TL;DR
Blend introduces a unified data discovery system that enables declarative composition of keyword, join, union, and correlation discovery through a minimal, in-DB AllTables index and a rule- and ML-based plan optimizer. It defines atomic Seeker and Combiner operators, a lightweight discovery language, and a plan-rewriting strategy to push work into the database, achieving faster, more scalable complex pipelines compared to ad-hoc baselines. The approach is validated across diverse data lakes and tasks, showing substantial runtime improvements, reduced code complexity, and effective optimization behavior, with a user study indicating strong practitioner demand for such declarative, optimized pipelines. The work highlights practical impact for accelerating data discovery in large-scale data lakes and outlines future directions toward semantic discovery and embedded vector indexing.
Abstract
Most research on data discovery has so far focused on improving individual discovery operators such as join, correlation, or union discovery. However, in practice, a combination of these techniques and their corresponding indexes may be necessary to support arbitrary discovery tasks. We propose BLEND, a comprehensive data discovery system that supports existing operators and enables their flexible pipelining. BLEND is based on a set of lower-level operators that serve as fundamental building blocks for more complex and sophisticated user tasks. To reduce the execution runtime of discovery pipelines, we propose a unified index structure and a rule-based optimizer that rewrites SQL statements into low-level operators when possible. We show the superior flexibility and efficiency of our system compared to ad-hoc discovery pipelines and stand-alone solutions.
