Table of Contents
Fetching ...

Featurized-Decomposition Join: Low-Cost Semantic Joins with Guarantees

Sepanta Zeighami, Shreya Shankar, Aditya Parameswaran

TL;DR

This paper tackles the high cost of semantic joins that rely on LLMs by introducing Featurized Decomposition Join (FDJ), which constructs a CNF of inexpensive, feature-based predicates to prune candidate pairs before invoking the LLM. FDJ automatically generates a rich set of features via an LLM-driven pipeline and uses a theoretically grounded threshold-setting procedure to guarantee recall while minimizing cost. The approach is shown to be NP-hard in its full form, but the authors provide a practical two-stage solution with statistical guarantees and a comprehensive analysis of thresholds, scaffolds, and target adjustment. Empirical results across six real-world datasets demonstrate up to 10x cost reductions compared with the state-of-the-art cascade methods, with robust performance under varying data characteristics and recall targets. Overall, FDJ offers a principled, scalable framework for efficient, reliable semantic joins in LLM-enabled data systems.

Abstract

Large Language Models (LLMs) are being increasingly used within data systems to process large datasets with text fields. A broad class of such tasks involves a semantic join-joining two tables based on a natural language predicate per pair of tuples, evaluated using an LLM. Semantic joins generalize tasks such as entity matching and record categorization, as well as more complex text understanding tasks. A naive implementation is expensive as it requires invoking an LLM for every pair of rows in the cross product. Existing approaches mitigate this cost by first applying embedding-based semantic similarity to filter candidate pairs, deferring to an LLM only when similarity scores are deemed inconclusive. However, these methods yield limited gains in practice, since semantic similarity may not reliably predict the join outcome. We propose Featurized-Decomposition Join (FDJ for short), a novel approach for performing semantic joins that significantly reduces cost while preserving quality. FDJ automatically extracts features and combines them into a logical expression in conjunctive normal form that we call a featurized decomposition to effectively prune out non-matching pairs. A featurized decomposition extracts key information from text records and performs inexpensive comparisons on the extracted features. We show how to use LLMs to automatically extract reliable features and compose them into logical expressions while providing statistical guarantees on the output result-an inherently challenging problem due to dependencies among features. Experiments on real-world datasets show up to 10 times reduction in cost compared with the state-of-the-art while providing the same quality guarantees.

Featurized-Decomposition Join: Low-Cost Semantic Joins with Guarantees

TL;DR

This paper tackles the high cost of semantic joins that rely on LLMs by introducing Featurized Decomposition Join (FDJ), which constructs a CNF of inexpensive, feature-based predicates to prune candidate pairs before invoking the LLM. FDJ automatically generates a rich set of features via an LLM-driven pipeline and uses a theoretically grounded threshold-setting procedure to guarantee recall while minimizing cost. The approach is shown to be NP-hard in its full form, but the authors provide a practical two-stage solution with statistical guarantees and a comprehensive analysis of thresholds, scaffolds, and target adjustment. Empirical results across six real-world datasets demonstrate up to 10x cost reductions compared with the state-of-the-art cascade methods, with robust performance under varying data characteristics and recall targets. Overall, FDJ offers a principled, scalable framework for efficient, reliable semantic joins in LLM-enabled data systems.

Abstract

Large Language Models (LLMs) are being increasingly used within data systems to process large datasets with text fields. A broad class of such tasks involves a semantic join-joining two tables based on a natural language predicate per pair of tuples, evaluated using an LLM. Semantic joins generalize tasks such as entity matching and record categorization, as well as more complex text understanding tasks. A naive implementation is expensive as it requires invoking an LLM for every pair of rows in the cross product. Existing approaches mitigate this cost by first applying embedding-based semantic similarity to filter candidate pairs, deferring to an LLM only when similarity scores are deemed inconclusive. However, these methods yield limited gains in practice, since semantic similarity may not reliably predict the join outcome. We propose Featurized-Decomposition Join (FDJ for short), a novel approach for performing semantic joins that significantly reduces cost while preserving quality. FDJ automatically extracts features and combines them into a logical expression in conjunctive normal form that we call a featurized decomposition to effectively prune out non-matching pairs. A featurized decomposition extracts key information from text records and performs inexpensive comparisons on the extracted features. We show how to use LLMs to automatically extract reliable features and compose them into logical expressions while providing statistical guarantees on the output result-an inherently challenging problem due to dependencies among features. Experiments on real-world datasets show up to 10 times reduction in cost compared with the state-of-the-art while providing the same quality guarantees.

Paper Structure

This paper contains 43 sections, 10 theorems, 59 equations, 10 figures, 3 tables, 9 algorithms.

Key Result

theorem 1

MCFD is NP-hard.

Figures (10)

  • Figure 1: Example of a Featurized Decomposition
  • Figure 2: FDJ Example Workflow
  • Figure 3: Terminology Summary
  • Figure 4: Finding Featurized Decompositions
  • Figure 5: Candidate Featurization Generation Example
  • ...and 5 more figures

Theorems & Definitions (14)

  • definition 1: Approximate Semantic Joins with Guarantees
  • definition 2
  • theorem 1
  • theorem 2
  • lemma 1
  • theorem 3
  • lemma 2
  • proposition 1: Cost Complexity
  • definition 3: Set Cover
  • lemma 3
  • ...and 4 more