Efficiently Estimating Mutual Information Between Attributes Across Tables

Aécio Santos; Flip Korn; Juliana Freire

Efficiently Estimating Mutual Information Between Attributes Across Tables

Aécio Santos, Flip Korn, Juliana Freire

TL;DR

This paper tackles the high cost of discovering relevant external tables for relational data augmentation by estimating the mutual information $I(X;Y)$ over left joins without materializing the full join. It introduces sketch-based sampling methods, LV2SK as a baseline and TUPSK as a superior tuple-based approach, which provide a fixed-size, unbiased-like sample of the join to feed existing MI estimators. Through extensive synthetic and real-data experiments, the authors show that TUPSK consistently yields accurate MI approximations, robust to join-key distributions and data-type mixtures, and improves the reliability of feature rankings for augmentation. The approach enables scalable, MI-driven data discovery in large data lakes and opens paths for more principled, theory-backed data augmentation pipelines with provable sketch properties.

Abstract

Relational data augmentation is a powerful technique for enhancing data analytics and improving machine learning models by incorporating columns from external datasets. However, it is challenging to efficiently discover relevant external tables to join with a given input table. Existing approaches rely on data discovery systems to identify joinable tables from external sources, typically based on overlap or containment. However, the sheer number of tables obtained from these systems results in irrelevant joins that need to be performed; this can be computationally expensive or even infeasible in practice. We address this limitation by proposing the use of efficient mutual information (MI) estimation for finding relevant joinable tables. We introduce a new sketching method that enables efficient evaluation of relationship discovery queries by estimating MI without materializing the joins and returning a smaller set of tables that are more likely to be relevant. We also demonstrate the effectiveness of our approach at approximating MI in extensive experiments using synthetic and real-world datasets.

Efficiently Estimating Mutual Information Between Attributes Across Tables

TL;DR

This paper tackles the high cost of discovering relevant external tables for relational data augmentation by estimating the mutual information

over left joins without materializing the full join. It introduces sketch-based sampling methods, LV2SK as a baseline and TUPSK as a superior tuple-based approach, which provide a fixed-size, unbiased-like sample of the join to feed existing MI estimators. Through extensive synthetic and real-data experiments, the authors show that TUPSK consistently yields accurate MI approximations, robust to join-key distributions and data-type mixtures, and improves the reliability of feature rankings for augmentation. The approach enables scalable, MI-driven data discovery in large data lakes and opens paths for more principled, theory-backed data augmentation pipelines with provable sketch properties.

Abstract

Paper Structure (27 sections, 7 equations, 5 figures, 2 tables)

This paper contains 27 sections, 7 equations, 5 figures, 2 tables.

Introduction
Background
MI Estimation For Data Augmentation
Problem Statement
Joining Arbitrary Tables
MI Estimation using Sketches
Baseline: Two-Level Sampling (LV2SK)
Proposed Approach: Tuple-based Sampling (TUPSK)
Experimental Evaluation
Synthetic Data Generation
Experiments Using Synthetic Data
True vs. Estimated MI on Full-Table Joins
Assessing Sketch Estimation Accuracy
Effect of the Join Key Distribution
Effect of Distinct Values
...and 12 more sections

Figures (5)

Figure 1: Example of relational data augmentation for the problem of taxi demand prediction. Adding new features, such as AVG[Temp] and AVG[Rainfall], derived from external tables helps predict, or explain the variance of, the NumTrips attribute. The augmented table (d) is derived by joining $\mathcal{T}_{taxi}$ and $\mathcal{T}_{weather}$ on Date, and with $\mathcal{T}_{demographics}$ on ZipCode.
Figure 2: True MI vs MI estimates computed using sketches of size $n=256$. Each plot shows a different method (LV2SK on the left and TUPSK on the right) and each line shows results for different data types/estimators and join key generation processes. TUPSK is more robust to the join key distribution.
Figure 3: True MI vs MI estimates computed using sketches of size $n=256$ for CDUnif. Each plot shows a different sketching method while each line shows results for different data types/estimators and join key generation processes.
Figure 4: Sketch MI estimate versus the true MI computed using distribution parameters. Sketch size is $n=256$ for all plots.
Figure 5: Sketch MI estimate versus the MI estimate computed using the full join output for tables from the WBF collection. Sketches are created using TUPSK with size $n=1024$ for all plots.

Theorems & Definitions (3)

Example 1: Understanding Taxi Demand
Definition
Example 2

Efficiently Estimating Mutual Information Between Attributes Across Tables

TL;DR

Abstract

Efficiently Estimating Mutual Information Between Attributes Across Tables

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (3)