Table of Contents
Fetching ...

Nexus: Inferring Join Graphs from Metadata Alone via Iterative Low-Rank Matrix Completion

Tianji Cong, Yuanyuan Tian, Andreas Mueller, Rathijit Sen, Yeye He, Fotis Psallidas, Shaleen Deep, H. V. Jagadish

TL;DR

Nexus tackles the challenge of inferring join graphs using metadata alone, motivated by privacy constraints in lakehouse environments. It reveals that real-world join graphs are inherently sparse and low-rank, and formalizes join graph inference as a low-rank matrix completion problem augmented with an EM loop that leverages LLM-based semantic typing to refine candidate joins. The approach achieves state-of-the-art accuracy across multiple datasets, with a fast mode (Nexus-Fast) delivering substantial speedups while maintaining strong results. Empirical results show Nexus outperforms data-value–dependent baselines, and its metadata-only design provides practical applicability in production settings with limited data access. The combination of core submatrix optimization, LRMC, and LLM-guided EM offers a scalable, privacy-preserving solution for automated data discovery and integration in modern data ecosystems.

Abstract

Automatically inferring join relationships is a critical task for effective data discovery, integration, querying and reuse. However, accurately and efficiently identifying these relationships in large and complex schemas can be challenging, especially in enterprise settings where access to data values is constrained. In this paper, we introduce the problem of join graph inference when only metadata is available. We conduct an empirical study on a large number of real-world schemas and observe that join graphs when represented as adjacency matrices exhibit two key properties: high sparsity and low-rank structure. Based on these novel observations, we formulate join graph inference as a low-rank matrix completion problem and propose Nexus, an end-to-end solution using only metadata. To further enhance accuracy, we propose a novel Expectation-Maximization algorithm that alternates between low-rank matrix completion and refining join candidate probabilities by leveraging Large Language Models. Our extensive experiments demonstrate that Nexus outperforms existing methods by a significant margin on four datasets including a real-world production dataset. Additionally, Nexus can operate in a fast mode, providing comparable results with up to 6x speedup, offering a practical and efficient solution for real-world deployments.

Nexus: Inferring Join Graphs from Metadata Alone via Iterative Low-Rank Matrix Completion

TL;DR

Nexus tackles the challenge of inferring join graphs using metadata alone, motivated by privacy constraints in lakehouse environments. It reveals that real-world join graphs are inherently sparse and low-rank, and formalizes join graph inference as a low-rank matrix completion problem augmented with an EM loop that leverages LLM-based semantic typing to refine candidate joins. The approach achieves state-of-the-art accuracy across multiple datasets, with a fast mode (Nexus-Fast) delivering substantial speedups while maintaining strong results. Empirical results show Nexus outperforms data-value–dependent baselines, and its metadata-only design provides practical applicability in production settings with limited data access. The combination of core submatrix optimization, LRMC, and LLM-guided EM offers a scalable, privacy-preserving solution for automated data discovery and integration in modern data ecosystems.

Abstract

Automatically inferring join relationships is a critical task for effective data discovery, integration, querying and reuse. However, accurately and efficiently identifying these relationships in large and complex schemas can be challenging, especially in enterprise settings where access to data values is constrained. In this paper, we introduce the problem of join graph inference when only metadata is available. We conduct an empirical study on a large number of real-world schemas and observe that join graphs when represented as adjacency matrices exhibit two key properties: high sparsity and low-rank structure. Based on these novel observations, we formulate join graph inference as a low-rank matrix completion problem and propose Nexus, an end-to-end solution using only metadata. To further enhance accuracy, we propose a novel Expectation-Maximization algorithm that alternates between low-rank matrix completion and refining join candidate probabilities by leveraging Large Language Models. Our extensive experiments demonstrate that Nexus outperforms existing methods by a significant margin on four datasets including a real-world production dataset. Additionally, Nexus can operate in a fast mode, providing comparable results with up to 6x speedup, offering a practical and efficient solution for real-world deployments.
Paper Structure (24 sections, 5 equations, 29 figures, 3 tables, 2 algorithms)

This paper contains 24 sections, 5 equations, 29 figures, 3 tables, 2 algorithms.

Figures (29)

  • Figure 1: An example schema of a bank database with its corresponding join graph and join graph matrix.
  • Figure 2: CDF of density and normalized rank of the join graph matrices for the curated real-world database schemas.
  • Figure 3: Overview of Nexus pipeline that infers join graphs from only metadata.
  • Figure 9: Comparison of F1 score on TPC-H and TPC-DS when data values are available.
  • Figure : TPC-H
  • ...and 24 more figures

Theorems & Definitions (5)

  • Definition 1: Join Graph
  • Definition 2: Join Graph Matrix
  • Definition 3: Join Graph Probability Matrix
  • Definition 4: Metadata-Only Join Graph Inference
  • Definition 5: Core Submatrix