OpenForge: Probabilistic Metadata Integration
Tianji Cong, Fatemeh Nargesian, Junjie Xing, H. V. Jagadish
TL;DR
OpenForge addresses metadata integration by modeling relationships among metadata concepts as a MAP inference problem on a Markov Random Field, enforcing transitivity constraints via a shared ternary potential. It uses a two-stage approach: first aggregating priors from prompting/fine-tuning LLMs and traditional ML, then refining predictions with MAP inference on an MRF conditioned on observed evidence $\mathcal{E}$ and axioms $\mathcal{A}$ to yield the optimal relationship graph $\tilde{G}$ that maximizes $P(G|\mathcal{E},\mathcal{A})$. On three real-world datasets (SOTAB, Walmart-Amazon for equivalence; ICPSR for taxonomy), OpenForge consistently outperforms baselines including GPT-4 by substantial margins (up to 25 F1 points) and remains scalable with GPU-accelerated inference and local MRF decomposition. These results demonstrate a practical, scalable method to unify and maintain heterogeneous metadata vocabularies, improving findability and interoperability of data assets.
Abstract
Modern data stores increasingly rely on metadata for enabling diverse activities such as data cataloging and search. However, metadata curation remains a labor-intensive task, and the broader challenge of metadata maintenance -- ensuring its consistency, usefulness, and freshness -- has been largely overlooked. In this work, we tackle the problem of resolving relationships among metadata concepts from disparate sources. These relationships are critical for creating clean, consistent, and up-to-date metadata repositories, and a central challenge for metadata integration. We propose OpenForge, a two-stage prior-posterior framework for metadata integration. In the first stage, OpenForge exploits multiple methods including fine-tuned large language models to obtain prior beliefs about concept relationships. In the second stage, OpenForge refines these predictions by leveraging Markov Random Field, a probabilistic graphical model. We formalize metadata integration as an optimization problem, where the objective is to identify the relationship assignments that maximize the joint probability of assignments. The MRF formulation allows OpenForge to capture prior beliefs while encoding critical relationship properties, such as transitivity, in probabilistic inference. Experiments on real-world datasets demonstrate the effectiveness and efficiency of OpenForge. On a use case of matching two metadata vocabularies, OpenForge outperforms GPT-4, the second-best method, by 25 F1-score points.
