Table of Contents
Fetching ...

Novel Table Search [Technical Report]

Besat Kassaie, Renée J. Miller

TL;DR

This work introduces a concrete scoring mechanism designed to maximize syntactic novelty, proves that it satisfies the proposed properties, and shows that the associated optimization problem is NP-hard.

Abstract

Avoiding redundancy in query results has been extensively studied in relational databases and information retrieval, yet its implications for data lakes remain largely unexplored. We bridge this gap by investigating how to discover unionable tables that contribute new information for a given query table in large-scale data lakes. We formally define Novel Table Search (NTS) as the problem of finding tables that are novel with respect to a given query table and identify two desirable properties that any scoring function for NTS should satisfy. We introduce a concrete scoring mechanism designed to maximize syntactic novelty, prove that it satisfies the proposed properties, and show that the associated optimization problem is NP-hard. To address this challenge, we develop an efficient approximation technique based on penalization, i.e., Attribute-Based Novel Table Search (ANTs). We propose three additional NTS variants to achieve syntactic novelty and introduce two evaluation metrics for syntactic novelty. Through extensive experiments, we demonstrate that ANTs outperforms other methods in capturing syntactic novelty across evaluation metrics and various benchmarks, while also achieving the lowest execution time.

Novel Table Search [Technical Report]

TL;DR

This work introduces a concrete scoring mechanism designed to maximize syntactic novelty, proves that it satisfies the proposed properties, and shows that the associated optimization problem is NP-hard.

Abstract

Avoiding redundancy in query results has been extensively studied in relational databases and information retrieval, yet its implications for data lakes remain largely unexplored. We bridge this gap by investigating how to discover unionable tables that contribute new information for a given query table in large-scale data lakes. We formally define Novel Table Search (NTS) as the problem of finding tables that are novel with respect to a given query table and identify two desirable properties that any scoring function for NTS should satisfy. We introduce a concrete scoring mechanism designed to maximize syntactic novelty, prove that it satisfies the proposed properties, and show that the associated optimization problem is NP-hard. To address this challenge, we develop an efficient approximation technique based on penalization, i.e., Attribute-Based Novel Table Search (ANTs). We propose three additional NTS variants to achieve syntactic novelty and introduce two evaluation metrics for syntactic novelty. Through extensive experiments, we demonstrate that ANTs outperforms other methods in capturing syntactic novelty across evaluation metrics and various benchmarks, while also achieving the lowest execution time.
Paper Structure (44 sections, 2 theorems, 28 equations, 11 figures, 12 tables, 1 algorithm)

This paper contains 44 sections, 2 theorems, 28 equations, 11 figures, 12 tables, 1 algorithm.

Key Result

Theorem 1

The Search Novelty Score (Definition df:nscore_resultset) satisfies the blatant duplicate axiom and the dilution axiom. $\square$

Figures (11)

  • Figure 1: Given a query table $Q$, $\mathsf{NTS}$ takes as input $k$ unionable tables from Step 1 ($|S| = k$) and selects the $l$ most novel tables ($|R|\!=\!l$) using a novelty-aware scoring function. If the search method lacks built-in attribute alignment, an external aligner is applied.
  • Figure 2: Motivating example. Each tuple has been augmented by a unique identifier ID to distinguish it from others.
  • Figure 3: System performance across datasets for SSNM (left) and SNM (right), $l\!\in\![2,10]$. Y-axis truncated.
  • Figure 4: Exec. time across datasets, averaged over all queries ($l\!\in\![2,10]$). Y-axis truncated.
  • Figure 5: Comparison of SNM (right column) and SSNM (left column) across the Santos, TUS, and Ugen-v2 Small datasets.
  • ...and 6 more figures

Theorems & Definitions (25)

  • Example 1.1
  • Definition 1
  • Example 2.1
  • Definition 2
  • Definition 3
  • Definition 4
  • Example 3.1
  • Definition 5
  • Definition 6
  • Example 3.2
  • ...and 15 more