Table of Contents
Fetching ...

Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

Riccardo Cappuzzo, Aimee Coelho, Felix Lefebvre, Paolo Papotti, Gael Varoquaux

TL;DR

The paper tackles learning-from-data-lakes by proposing and evaluating a Retrieve-Merge-Predict pipeline to augment a base table with joinable tables from a data lake. It introduces YADL, a semi-synthetic benchmarking data lake derived from YAGO, to enable controlled, reproducible assessment of retrieval, merging, and prediction steps across varying scales and noise levels. The study finds that simple, containment-based retrieval and tree-based models (especially CatBoost) deliver robust performance with favorable compute profiles, while complex retrieval or aggregation often yields diminishing returns relative to cost. These results offer practical guidance for automating feature engineering in data-lake contexts and highlight directions for more automated, scalable AutoML-like pipelines in heterogeneous data ecosystems.

Abstract

Machine-learning from a disparate set of tables, a data lake, requires assembling features by merging and aggregating tables. Data discovery can extend autoML to data tables by automating these steps. We present an in-depth analysis of such automated table augmentation for machine learning tasks, analyzing different methods for the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table. We use two data lakes: Open Data US, a well-referenced real data lake, and a novel semi-synthetic dataset, YADL (Yet Another Data Lake), which we developed as a tool for benchmarking this data discovery task. Systematic exploration on both lakes outlines 1) the importance of accurately retrieving candidate tables to join, 2) the efficiency of simple merging methods, and 3) the resilience of tree-based learners to noisy conditions. Our experimental environment is easily reproducible and based on open data, to foster more research on feature engineering, autoML, and learning in data lakes.

Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

TL;DR

The paper tackles learning-from-data-lakes by proposing and evaluating a Retrieve-Merge-Predict pipeline to augment a base table with joinable tables from a data lake. It introduces YADL, a semi-synthetic benchmarking data lake derived from YAGO, to enable controlled, reproducible assessment of retrieval, merging, and prediction steps across varying scales and noise levels. The study finds that simple, containment-based retrieval and tree-based models (especially CatBoost) deliver robust performance with favorable compute profiles, while complex retrieval or aggregation often yields diminishing returns relative to cost. These results offer practical guidance for automating feature engineering in data-lake contexts and highlight directions for more automated, scalable AutoML-like pipelines in heterogeneous data ecosystems.

Abstract

Machine-learning from a disparate set of tables, a data lake, requires assembling features by merging and aggregating tables. Data discovery can extend autoML to data tables by automating these steps. We present an in-depth analysis of such automated table augmentation for machine learning tasks, analyzing different methods for the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table. We use two data lakes: Open Data US, a well-referenced real data lake, and a novel semi-synthetic dataset, YADL (Yet Another Data Lake), which we developed as a tool for benchmarking this data discovery task. Systematic exploration on both lakes outlines 1) the importance of accurately retrieving candidate tables to join, 2) the efficiency of simple merging methods, and 3) the resilience of tree-based learners to noisy conditions. Our experimental environment is easily reproducible and based on open data, to foster more research on feature engineering, autoML, and learning in data lakes.
Paper Structure (59 sections, 2 equations, 21 figures, 10 tables, 1 algorithm)

This paper contains 59 sections, 2 equations, 21 figures, 10 tables, 1 algorithm.

Figures (21)

  • Figure 1: The evaluation pipeline. Given a base table, the three main steps (Retrieve, Merge, Predict) augment it with the information from the lake to improve the prediction performance. The data preparation step can be executed offline, and the resulting index may be reused across different data usage instances.
  • Figure 2: Pareto diagram for the pipeline steps The prediction performance ($y$-axis) is plotted against retrieval + run time (top) and peak RAM usage (bottom). Each row presents the same results, broken down by retrieval method (left), join selector method (center) and predictor (right). Each dot represents the average prediction performance and resource cost averaged across base tables and data lakes for a specific configuration (e.g., the leftmost dot in the first row is obtaining by using MinHash, Highest Containment Join and CatBoost). Time for offline retrieval preparation for MinHash and Starmie is not reported here.
  • Figure 3: Retrieval: Better containment improves prediction performance. Regression plot relating the prediction performance with the Jaccard containment; each dot represents an experimental run for each base table.
  • Figure 4: Trade-off in prediction performance as the number of candidates increases for the Full Join selector. Results are averaged over all base tables and data lakes and represented as a Pareto plot.
  • Figure 5: Aggregation: Pareto diagram. DFS outperforms other methods, but is much slower.
  • ...and 16 more figures