Nexus: Inferring Join Graphs from Metadata Alone via Iterative Low-Rank Matrix Completion
Tianji Cong, Yuanyuan Tian, Andreas Mueller, Rathijit Sen, Yeye He, Fotis Psallidas, Shaleen Deep, H. V. Jagadish
TL;DR
Nexus tackles the challenge of inferring join graphs using metadata alone, motivated by privacy constraints in lakehouse environments. It reveals that real-world join graphs are inherently sparse and low-rank, and formalizes join graph inference as a low-rank matrix completion problem augmented with an EM loop that leverages LLM-based semantic typing to refine candidate joins. The approach achieves state-of-the-art accuracy across multiple datasets, with a fast mode (Nexus-Fast) delivering substantial speedups while maintaining strong results. Empirical results show Nexus outperforms data-value–dependent baselines, and its metadata-only design provides practical applicability in production settings with limited data access. The combination of core submatrix optimization, LRMC, and LLM-guided EM offers a scalable, privacy-preserving solution for automated data discovery and integration in modern data ecosystems.
Abstract
Automatically inferring join relationships is a critical task for effective data discovery, integration, querying and reuse. However, accurately and efficiently identifying these relationships in large and complex schemas can be challenging, especially in enterprise settings where access to data values is constrained. In this paper, we introduce the problem of join graph inference when only metadata is available. We conduct an empirical study on a large number of real-world schemas and observe that join graphs when represented as adjacency matrices exhibit two key properties: high sparsity and low-rank structure. Based on these novel observations, we formulate join graph inference as a low-rank matrix completion problem and propose Nexus, an end-to-end solution using only metadata. To further enhance accuracy, we propose a novel Expectation-Maximization algorithm that alternates between low-rank matrix completion and refining join candidate probabilities by leveraging Large Language Models. Our extensive experiments demonstrate that Nexus outperforms existing methods by a significant margin on four datasets including a real-world production dataset. Additionally, Nexus can operate in a fast mode, providing comparable results with up to 6x speedup, offering a practical and efficient solution for real-world deployments.
